13 datasets found

Black Friday Sales EDA
kaggle.com
Updated Oct 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rushikesh Konapure
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset History

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

Tasks to perform

The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

Masked in the column description means already converted from categorical value to numerical column.

Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

DATA PREPROCESSING

Check the basic statistics of the dataset

Check for missing values in the data

Check for unique values in data

Perform EDA

Purchase Distribution

Check for outliers

Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

Drop unnecessary fields

Convert categorical data into integer using map function (e.g 'Gender' column)

Missing value treatment

Rename columns

Fill nan values

map range variables into integers (e.g 'Age' column)

Data Visualisation

visualize individual column

Age vs Purchased

Occupation vs Purchased

Productcategory1 vs Purchased

Productcategory2 vs Purchased

Productcategory3 vs Purchased

City category pie chart

check for more possible plots

All the Best!!
e
A global database of long-term changes in insect assemblages
knb.ecoinformatics.org
search-dev.test.dataone.org
+4more
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2022). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F1ZC817H
Explore at:
Unique identifier
https://doi.org/10.5063/F1ZC817H
Dataset updated
Jan 26, 2022
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Pacific Ocean, North Pacific Ocean
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 63 more
Description
UPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
e
InsectChange: A global database of long-term changes in insect, arachnid and...
knb.ecoinformatics.org
search.dataone.org
+1more
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen (2023). InsectChange: A global database of long-term changes in insect, arachnid and Entognatha assemblages [Dataset]. https://knb.ecoinformatics.org/view/urn%3Auuid%3A9c946111-05e2-48c9-afb1-2783ee43d0ed
Explore at:
Dataset updated
Oct 2, 2023
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen
Time period covered
Jan 1, 1925 - Jan 1, 2020
Area covered
Pacific Ocean, North Pacific Ocean
Variables measured
End, Date, Link, Rank, Year, Class, Error, Genus, Level, Order, and 79 more
Description
UPDATE 2023 New tables have been added and old tables have been updated with new datasets Table 'Rawdata 2023.csv' contains the data as extracted from the publications, and as provided by the data owners. This table is linked to the table 'Taxondata 2023' via the column 'Taxon', to the table 'PlotData 2023' via the column 'Plot_ID' and to the table 'SampleData 2023' via the table 'Sample_ID'. This is the data from which the table 'InsectAbundanceBiomassData' was produced, but will not reproduce the exact same table as used in 2020 for the following reasons: 1) some mistakes have been corrected, 2) new datasets were added, 3) it contains other taxa than insects, arachnids and Entognatha: mollusks, worms and some vertebrates are still retained if they were present in the raw data and should be filtered out as needed, 4) other biodiversity metrics than abundance or biomass are included: density (abundance per fixed surface area), richness (number of species), Shannon (Shannon-Wiener index), Pielou (pielou's evenness index), rarefiedRichness (expected number of species for a fixed number of individuals), ENSPIE (inverted Simpson diversity index = Effective number of species of the Probability of interspecific encounter), 5) some raw data are still missing from this table, because for our newer work the raw data needed rarefaction for equalizing sampling effort. One study that was obviously incorrect has been removed (Datasource_ID 70). Table 'Taxondata 2023.csv' contains a taxonomic backbone to resolve all taxa in the table 'rawData 2023' to higher taxonomy. Note that this taxonomy is not corrected for synonyms and taxonomic changes. Nevertheless, the higher taxa are correctly assigned This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data set consists of five linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
H
Bangladesh rape case dataset
dataverse.harvard.edu
Updated Nov 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sajratul Yakin Rubaiat; Sheherjan Haq (2024). Bangladesh rape case dataset [Dataset]. http://doi.org/10.7910/DVN/OE7NFR
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/OE7NFR
Dataset updated
Nov 9, 2024
Dataset provided by
Harvard Dataverse
Authors
Sajratul Yakin Rubaiat; Sheherjan Haq
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
The "Bangladesh Rape Cases Data" dataset contains detailed information on rape cases reported in various districts of Bangladesh. This dataset is valuable for analyzing trends, patterns, and regional distributions of reported rape cases over a decade. It can be utilized by researchers, policymakers, and social scientists to study and address the issue of rape in Bangladesh. Total Sample Size: This dataset comprises a total of 2,813 rows, each representing an individual case reported in various districts across Bangladesh. Data Description: headline: Type: String Description: The headline of the news article reporting the rape case. It provides a brief summary of the incident. district-tag: Type: String Description: The district where the incident occurred. This helps in identifying the geographical distribution of the cases. division-tag: Type: String Description: The division of Bangladesh to which the district belongs. This is useful for broader regional analysis. subdistrict-tag: Type: String Description: The specific subdistrict or locality within the district where the incident occurred. This column may contain missing values if the subdistrict is not specified. id: Type: String (UUID format) Description: A unique identifier for each news article, ensuring that each entry can be distinctly referenced. url: Type: String Description: The web link to the original news article, allowing users to access the full report for more detailed information. last-published-at: Type: DateTime Description: The date and time when the news article was last published, helping to understand the timeline of the reported cases. offset: Type: Integer Description: An offset value for the article, potentially indicating its position in a larger dataset or the order of processing. content: Type: String Description: The main content of the news article, providing detailed information about the incident. Temporal Coverage: Minimum Date: February 22, 2013 Maximum Date: April 10, 2023 The dataset spans over a decade, allowing for a comprehensive temporal analysis of the reported cases. Potential Uses: Trend Analysis: Analyze how the frequency of reported cases changes over time. Geographical Analysis: Identify regions with higher or lower reporting rates. Content Analysis: Examine the language and details provided in the headlines and content to understand the nature of reporting. Correlation Studies: Investigate possible correlations between reported cases and other socio-economic factors. Data Quality and Considerations: Missing Values: Some columns, such as subdistrict-tag, may contain missing values where specific information was not provided. Data Source: The data is sourced from news articles, so it may be influenced by reporting biases and the availability of news coverage.
e
A global database of long-term changes in insect assemblages
knb.ecoinformatics.org
Updated Oct 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F11V5C9V
Explore at:
Unique identifier
https://doi.org/10.5063/F11V5C9V
Dataset updated
Oct 1, 2020
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Pacific Ocean, North Pacific Ocean
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
Description
This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources, representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
u
A global database of long-term changes in insect assemblages
data.nceas.ucsb.edu
Updated Mar 31, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. https://data.nceas.ucsb.edu/view/urn%3Auuid%3A596c7d3c-f22b-489f-91a5-fddc03a1f481
Explore at:
Dataset updated
Mar 31, 2020
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
Time period covered
Jan 1, 1925 - Jan 1, 2018
Area covered
Pacific Ocean, North Pacific Ocean
Variables measured
End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
Description
This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. An additional table presents all references to the data sources, and, if applicable, the open access license under which these are published. When using (parts of) this data set, please respect the original access licenses. This data set underlies all analyses performed in 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes. Tables for calculating trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References'. In the table 'PlotData' more details on each site within a dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses (except for column 'Stratum', which is derived from table "SampleData'). Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).
d
Data from: Multigenerational exposure to increased temperature reduces...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated May 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emma Moffett; David Fryxell; Kevin Simon (2022). Multigenerational exposure to increased temperature reduces metabolic rate but increases boldness in Gambusia affinis [Dataset]. http://doi.org/10.7280/D1MT39
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.7280/D1MT39
Dataset updated
May 3, 2022
Dataset provided by
Dryad
Authors
Emma Moffett; David Fryxell; Kevin Simon
Time period covered
2022
Description
Males have no data in the pregnancy column

Individuals that did not leave the refuge were not measured for activity and have no data in the "Time_spent_exploring_s" column NA's in dataset indicate missing data

Column descriptions/ units:

Site refers to the location where Gambusia were collected

Geothermal refers to the designation of sites as geothermal or ambient, geothermal sites received warm water inputs and ambient sites experience daily and seasonal changes in environmental temperature.

Source_Temp refers to the temperature of the site at the time of fish collection in degrees Celcius.

Temp_lab refers to the temperature that fish were acclimated in the laboratory in degrees Celcius

Run refers to the order in which metabolic rate was measured on fish

Fish_ID is a unique identifier for each fish in this study.

Sex refers to the sex of the fish measurements were done on (M = male, F = female).

Pregnancy describes if the fish was visibly pregnant at the ti...
Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East...
osti.gov
knb.ecoinformatics.org
+2more
Updated Jan 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States) (2024). Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East River Watershed, Colorado (2015-2023) [Dataset]. http://doi.org/10.15485/1660459
Explore at:
Unique identifier
https://doi.org/10.15485/1660459
Dataset updated
Jan 1, 2024
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
Watershed Function SFA
Description
This data package contains mean values for dissolved organic carbon (DOC) and dissolved inorganic carbon (DIC) for water samples taken from the East River Watershed in Colorado. The East River is part of the Watershed Function Scientific Focus Area (WFSFA) located in the Upper Colorado River Basin, United States. DOC and DIC concentrations in water samples were determined using a TOC-VCPH analyzer (Shimadzu Corporation, Japan). DOC was analyzed as non-purgeable organic carbon (NPOC) by purging HCl acidified samples with carbon-free air to remove DIC prior to measurement. After the acidified sample has been sparged, it is injected into a combustion tube filled with oxidation catalyst heated to 680 degrees C. The DOC in samples is combusted to CO2 and measured by a non-dispersive infrared (NDIR) detector. The peak area of the analog signal produced by the NDIR detector is proportional to the DOC concentration of the sample. DIC was determined by acidifying the samples with HCl first, and then purge with carbon-free air to release CO2 for analysis by NDIR detector. All files are labeled by location and variable, and data reported are the mean values upon minimum three replicate measurements with a relative standard deviation < 3%. All samples were analyzed under a rigorous quality assurance and quality control (QA/QC) process as detailed in the methods. This data package contains (1) a zip file (dic_npoc_data_2015-2023.zip) containing a total of 323 files: 322 data files of DIC and NPOC data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (v4_20240311_flmd.csv) file that lists each file contained in the dataset with associated metadata; (3) a data dictionary (v4_20240311_dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type; and (4) PDF and docx files for the determination of Method Detection Limits (MDLs) for DIC and NPOC data, which has been updated in 2024-03. Missing values within the anion data files are noted as either "-9999" or "0.0" for not detectable (N.D.) data. There are a total of 107 locations containing DIC/NPOC data.Update on 2020-10-07: Updated the data files to remove times from the timestamps, so that only dates remain. The data values have not changed.Update on 2021-04-11: Added Determination of Method Detection Limits (MDLs) for DIC, NPOC and TDN Analyses document, which can be accessed as a PDF or with Microsoft Word.Update on 6/10/2022: versioned updates to this dataset were made along with these changes: (1) updated dissolved inorganic carbon and dissolved organic carbon data for all locations up to 2021-12-31, (2) removal of units from column headers in datafiles, (3) added row underneath headers to contain units of variables, (4) restructure of units to comply with CSV reporting format requirements, (5) added -9999 for empty numerical cells, and (6) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format.Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv).Update on 2023-08-08: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-01-05. The file level metadata and data dictionary files were updated to reflect the additional data added.Update on 2024-03-11: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-11-21. Further, revisions to the data files were made to remove incorrect data points (from 1970 and 2001). The reporting format specific files were updated to reflect the additional data added. Revised versions of the PDF and docx files for determination of MDLs for DIC and NPOC were added to replace previous versions.
d
Anion Data for the East River Watershed, Colorado (2014-2023)
dataone.org
osti.gov
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Williams; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg (2024). Anion Data for the East River Watershed, Colorado (2014-2023) [Dataset]. https://dataone.org/datasets/ess-dive-7ebb5798646b4de-20240523T172944773
Explore at:
Dataset updated
May 23, 2024
Dataset provided by
ESS-DIVE
Authors
Kenneth Williams; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg
Time period covered
May 2, 2014 - Sep 11, 2023
Area covered

Description
The anion data for the East River Watershed, Colorado, consists of fluoride, chloride, sulfate, nitrate, and phosphate concentrations collected at multiple, long-term monitoring sites that include stream, groundwater, and spring sampling locations. These locations represent important and/or unique end-member locations for which solute concentrations can be diagnostic of the connection between terrestrial and aquatic systems. Such locations include drainages underlined entirely or largely by shale bedrock, land covered dominated by conifers, aspens, or meadows, and drainages impacted by historic mining activity and the presence of naturally mineralized rock. Developing a long-term record of solute concentrations from a diversity of environments is a critical component of quantifying the impacts of both climate change and discrete climate perturbations, such as drought, forest mortality, and wildfire, on the riverine export of multiple anionic species. Such data may be combined with stream gauging stations co-located at each monitoring site to directly quantify the seasonal and annual mass flux of these anionic species out of the watershed. This data package contains (1) a zip file (anion_data_2014-2023.zip) containing a total of 346 files: 345 data files of anion data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (v5_20240311_flmd.csv) file that lists each file contained in the dataset with associated metadata; and (3) a data dictionary (v5_20240311_dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type. Missing values within the anion data files are noted as either "-9999" or "0.0" for not detectable (N.D.) data. There are a total of 39 locations containing anion data. Update on 2022-06-10: versioned updates to this dataset were made along with these changes: (1) updated anion data for all locations up to 2021-12-31, (2) removal of units from column headers in datafiles, (3) added row underneath headers to contain units of variables, (4) restructure of units to comply with CSV reporting format requirements, and (5) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format. Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv). Update on 2022-12-20: Updates were made to both the data files and reporting format specific files. Conversion issues affecting ER-PLM locations for anion data was resolved for the data files. Additionally, the flmd and dd files were updated to reflect the updated versions of these files. Available data was added up until 2022-03-14. Update on 2023-08-08: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-05-19. The file level metadata and data dictionary files were updated to reflect the additional data added. Update on 2024-03-11: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-09-11. Further, revisions to the data files were made to remove incorrect data points (from 1970 and 2001). The reporting format specific files were updated to reflect the additional data added.
d
Datasets of Suspended Sediment Concentration and Percent Fines (1973–2021),...
datasets.ai
data.usgs.gov
+1more
55
Updated Dec 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2022). Datasets of Suspended Sediment Concentration and Percent Fines (1973–2021), Sampling Information (1973–2021), and Daily Streamflow (1928–2021) for Sites in the Lower Mississippi and Atchafalaya Rivers to Support Analyses of Sediment Transport and Delivery [Dataset]. https://datasets.ai/datasets/datasets-of-suspended-sediment-concentration-and-percent-fines-19732021-sampling-informati
Explore at:
55Available download formats
Dataset updated
Dec 28, 2022
Dataset authored and provided by
Department of the Interior
Description
Datasets of suspended sediment concentration and percent fines, sampling information, and daily streamflow data were compiled and harmonized for 16 sites to better understand sediment transport and delivery in the Lower Mississippi and Atchafalaya Rivers. The compiled data were harmonized by removing unnecessary columns, screening data for laboratory or sampling issues, creating consistent entries for character columns, and dropping irrelevant data, among other steps. Fourteen of the sites are in the Lower Mississippi-Atchafalaya River Basin with two additional sites on the Middle Mississippi and Ohio Rivers. Suspended sediment concentration (total, all size fractions) and percent fines for multiple size fractions were retrieved from the U.S. Geological Survey (USGS) National Water Information System (NWIS) database. These data were matched to related sampling information, such as sampler type and sampling method, also retrieved from NWIS. Continuous daily streamflow was compiled (or estimated where missing) for all sites and these data were from NWIS and the U.S. Army Corps of Engineers (USACE). Daily streamflow records extend as far back as possible and contain no gaps, whereas suspended sediment data and sampling information were measured and reported periodically and may contain multiyear gaps depending on the site. Note, siteIndex is used as the main sediment site identifier since sediment records from more than one USGS site are combined for at least one siteIndex. Additionally, gageIndex is used as the main streamgage identifier since streamflow records from multiple streamgages are sometimes combined for a single gageIndex and a single gageIndex may be used for more than one siteIndex. See the siteTable.csv for linkages between siteIndex, gageIndex, and USGS and USACE site/streamgage numbers.
Higher Education Institutions in France Dataset
zenodo.org
zip
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jackson Barreto; Jackson Barreto; Rodrigo Costa; Rodrigo Costa (2025). Higher Education Institutions in France Dataset [Dataset]. http://doi.org/10.5281/zenodo.14960627
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14960627
Dataset updated
Mar 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jackson Barreto; Jackson Barreto; Rodrigo Costa; Rodrigo Costa
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
France
Description
Higher Education Institutions in France Dataset

This repository contains a dataset of higher education institutions in France. This includes 349 higher education institutions in France, including universities, universities of applied sciences and Higher Institutes as Higher Institute of Engineering, Higher Institute of biotechnologies and few others. This dataset was compiled in response to a cybersecurity investigation of France higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].

Data

The data includes the following fields for each institution:

ETER_Id: A unique identifier assigned to each institution.

Name: The full name of the institution.

Category: Indicates whether the institution is public or private.

Institution_Category_Standardized: Indicates whether the institution is University, University of applied sciences or other.

Member_of_European_University_alliance: Indicates if the institution is member of European University Alliance (A kind of collaborative higher education institutions network in Europe).

Url: The website of the institution.

NUTS2: Nomenclature of Territorial Units for Statistics (NUTS): A classification by the European Union to divide member states' territories into statistical units. The NUTS system has three hierarchical levels, with NUTS2 being the second level.

NUTS2_Label_2016: Refers to the classification of regions at the NUTS2 level according to the 2016 criteria set by the European Union.

NUTS2_Label_2021: Refers to the classification of regions at the NUTS2 level according to the 2021 criteria set by the European Union.

NUTS3: Nomenclature of Territorial Units for Statistics (NUTS): A classification by the European Union to divide member states' territories into statistical units. The NUTS system has three hierarchical levels, with NUTS3 being the third level.

NUTS3_Label_2016: Refers to the classification of regions at the NUTS3 level according to the 2016 criteria set by the European Union.

NUTS3_Label_2021: Refers to the classification of regions at the NUTS3 level according to the 2021 criteria set by the European Union.

Methodology

The methodology for creating the dataset involved obtaining data from two sources: The European Higher Education Sector Observatory (ETER)[3]. The data was collected on December 26, 2024, the Eurostat for NUTS - Nomenclature of territorial units for statistics 2013-16[4] and 2021[5].

This section outlines the methodology used to create the dataset for Higher Education Institutions (HEIs) in France. The dataset consolidates information from various sources, processes the data, and enriches it to provide accurate and reliable insights.

Data Sources

ETER Database: The primary dataset was sourced from the ETER database, containing detailed information about HEIs in Europe.

File: eter-export-2021-FR.xlsx

Eurostat NUTS Data: Two datasets from Eurostat were used for regional information:

NUTS 2013-2016 regions: NUTS2013-NUTS2016.xlsx

NUTS 2021 regions: NUTS2021.xlsx

Data Cleaning and Preprocessing Column Renaming Columns in the raw dataset were renamed for consistency and readability. Examples include:

ETER ID → ETER_ID

Institution Name → Name

Legal status → Category

Value Replacement

HEI Categories: The Category column was cleaned, with government-dependent institutions classified as "public."

Standardized Institution Categories: Mapped numerical values to descriptive labels such as "University" and "University of applied sciences."

European University Alliance Membership: Replaced binary values with "Yes" or "No."

Handling Missing or Incorrect Data

Specific entries with missing or incorrect data were updated manually based on their ETER_ID. For instance:

Adjusted URLs for entries like FR0333 (updated to www.icam.fr)

Adjusted URLs for entries like FR0906 (updated to epss.fr)

Adjusted URLs for entries like FR0104 (updated to www.ensa-nancy.fr)

Adjusted URLs for entries like FR0466 (updated to www.clermont-auvergne-inp.fr)

Adjusted URLs for entries like FR0907 (updated to insp.gouv.fr) - This universety also changed your name for Institut national du service public

Removed entries such as FR0129 and FR0944 due to insufficient or invalid information.

Removed FR0513 Institut supérieur européen de gestion Lyon because it's the same url and school of Paris, so remains only the main campus in Paris

Remove FR0235 Institut supérieur de l'électronique et du numérique Toulon, because it's the same url of Institut supérieur de l'électronique et du numérique Lille, so remains only the main campus

Remove FR0106 and FR010 École spéciale militaire, because it's the url returns 403 forbiden

Remove FR0970 École nationale de la meteorologie beucase of invalid HTTPS

Regional Data Integration

Merged NUTS 2016 and NUTS 2021 data to enrich the dataset with regional labels.

Final Dataset The final dataset was saved as a CSV file: france-heis.csv, encoded in UTF-8 for compatibility. It includes detailed information about HEIs in France, their categories, regional affiliations, and membership in European alliances.

Summary This methodology ensures that the dataset is accurate, consistent, and enriched with valuable regional and institutional details. The final dataset is intended to serve as a reliable resource for analyzing French HEIs.

Usage

This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].

If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862

Contribution

If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.

Acknowledgment

We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.

References

Pending

S. Bezjak, A. Clyburne-Sherin, P. Conzett, P. Fernandes, E. Görögh, K. Helbig, B. Kramer, I. Labastida, K. Niemeyer, F. Psomopoulos, T. Ross-Hellauer, R. Schneider, J. Tennant, E. Verbakel, H. Brinken, and L. Heller, Open Science Training Handbook. Zenodo, Apr. 2018. [Online]. Available: [https://doi.org/10.5281/zenodo.1212496]

The European Higher Education Sector Observatory, Dec 2024. Available: ETER

NUTS - Nomenclature of territorial units for statistics, Dec 2024. Available: NUTS-2013-2016

NUTS - Nomenclature of territorial units for statistics, Dec 2024. Available: NUTS-2021.
h
De_Novo_Drug_Design.Bootstrap
huggingface.co
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jeff (2025). De_Novo_Drug_Design.Bootstrap [Dataset]. https://huggingface.co/datasets/AICanada/De_Novo_Drug_Design.Bootstrap
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2025
Authors
jeff
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Details of AF3D_pLDDT_PChem3D_Shapes_Energy_Binding_2025_03_14 data: De novo drug design: Energy minimization is used to generate new drug molecules from scratch based on the 3D structure of the target receptor Energy minimization, also known as geometry optimization. 9329 Rows of data provided out of total of 14109 rows, Columns filtered to remove null values. Duplicate "PDB_protein_key" mapped to same duplicate "AF 3D AtomicData"(along with all data in rows) due to the many to many… See the full description on the dataset page: https://huggingface.co/datasets/AICanada/De_Novo_Drug_Design.Bootstrap.
Z
Arcangelo Corelli - Sonate a tre (A corpus of annotated scores)
data.niaid.nih.gov
zenodo.org
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Hentschel (2023). Arcangelo Corelli - Sonate a tre (A corpus of annotated scores) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7504011
Explore at:
Dataset updated
Sep 8, 2023
Dataset provided by
Johannes Hentschel
Fabian C. Moss
Markus Neuwirth
Martin Rohrmeier
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Arcangelo Corelli - Trio Sonatas (A corpus of annotated scores)

This corpus of annotated MuseScore files has been created within the DCML corpus initiative and employs the DCML harmony annotation standard. It was relased together with and as part of the "workflow paper"

Hentschel, J., Moss, F. C., Neuwirth, M., & Rohrmeier, M. A. (2021). A semi-automated workflow paradigm for the distributed creation and curation of expert annotations. Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, 262–269. https://doi.org/10.5281/ZENODO.5624417

The corpus comprises 36 Sonate a tre, divided into 149 separate movements. Together they make up for three of the four famous cycles of 12 trio sonatas each:

Opus Cycle Publication Included 1 12 sonate da chiesa Rome 1681 Yes 2 12 sonate da camera Rome 1685 No 3 12 sonate da chiesa Rome 1689 Yes 4 12 sonate da camera Rome 1694 Yes

Versions

Version 2.0

TSV files now come with the column quarterbeats, which measures in quarter notes each event's position as its distance from the beginning

Extracted notes now come with the columns name and octave.

Column volta (containing first and second endings) removed from pieces that don't have any.

metadata.tsv has been enriched with further columns, in particular information about each movement's dimensions, including dimensions upon unfolding repeats (for instance, last_mn has the number of measures, last_mn_unfolded the number of measures when playing all repeats)

The folder reviewed contains two files per movement:

A copy of the score where all out-of-label notes have been colored in red; are shown in these files in a diff-like manner (removed in red, added in green).

A copy of the harmonies TSV with six added columns that reflect the coloring of out-of-label notes ("coloring reports")

As long as the ms3 review has any complaints, it stores them in the file warnings.log. Currently, it is showing those labels where over 60% of the notes in the segment have been colored in red and probably need revisiting ( Pull Requests welcome)

TSV files are automatically kept up to date using the new GitHub action dcml_corpus_workflow which is the successor of the implementation used in the creation of this dataset.

Version 1.1

This release marks the moment where all 149 movements include a reviewed set of annotations that adhere to version 2.3.0 of the DCML harmony annotation standard. The metadata have not been completed yet and the data were extracted one last time with the now deprecated version 0.4.11 of the MuseScore parser ms3 for matters of completeness and homogeneity. The purpose is mainly to substantiate the claim that the "semi-annotated workflow paradigm", as it had been implemented at publication time (see the ISMIR paper cited above), can indeed be put to effective use in the creation of a large dataset. This version is, however, to be followed by a version with upgraded tabular data based on the more mature ms3 > 1.0.0.

Version 1.0

The first release reflects the state of the dataset when finalizing chapter 4 of the workflow paper cited above.

Getting the data

With full version history

The dataset is version-controlled via git. In order to download the files with all revisions they have gone through, git needs to be installed on your machine. Then you can clone this repository using the command

git clone https://github.com/DCMLab/corelli.git

Without full version history

If you are only interested in the current version of the corpus, you can simply download and unpack this ZIP file.

Data Formats

Each piece in this corpus is represented by four files with identical names, each in its own folder. For example, the first movement of the first sonata has the following files:

MS3/op01n01a.mscx: Uncompressed MuseScore file including the music and annotation labels.

notes/op01n01a.tsv: A table of all note heads contained in the score and their relevant features (not each of them represents an onset, some are tied together)

measures/op01n01a.tsv: A table with relevant information about the measures in the score.

harmonies/op01n01a.tsv: A list of the included harmony labels (including cadences and phrases) with their positions in the score.

Opening Scores

After navigating to your local copy, you can open the scores in the folder MS3 with the free and open source score editor MuseScore. Please note that the scores have been edited, annotated and tested with MuseScore 3.6.2. MuseScore 4 has since been released and preliminary tests suggest that it renders them correctly.

Opening TSV files in a spreadsheet

Tab-separated value (TSV) files are like Comma-separated value (CSV) files and can be opened with most modern text editors. However, for correctly displaying the columns, you might want to use a spreadsheet or an addon for your favourite text editor. When you use a spreadsheet such as Excel, it might annoy you by interpreting fractions as dates. This can be circumvented by using Data --> From Text/CSV or the free alternative LibreOffice Calc. Other than that, TSV data can be loaded with every modern programming language.

Loading TSV files in Python

Since the TSV files contain null values, lists, fractions, and numbers that are to be treated as strings, you may want to use this code to load any TSV files related to this repository (provided you're doing it in Python). After a quick pip install -U ms3 (requires Python 3.10) you'll be able to load any TSV like this:

import ms3

labels = ms3.load_tsv('harmonies/op01n01a.tsv') notes = ms3.load_tsv('notes/op01n01a.tsv')

Column names

You can look up meaning and data type of the columns of all TSV files including metadata.tsv in ms3's documentation (simply search through the page).

Generating all TSV files from the scores

When you have made changes to the scores and want to update the TSV files accordingly, you can use the following command (provided you have pip-installed ms3):

ms3 extract -M -N -X -D # for measures, notes, expanded annotations, and metadata

If, in addition, you want to generate the reviewed scores with out-of-label notes colored in red, you can do

ms3 review -M -N -X -D # for extracting measures, notes, expanded annotations, and metadata

By adding the flag -c to the review command, it will additionally compare the (potentially modified) annotations in the score with the ones currently present in the harmonies TSV files and reflect the comparison in the reviewed scores.

Score origin

To create the dataset we downloaded the musicXML conversion available on Craig Sapp's KernScores (thanks to the engraver(s) who first encoded the scores in **kern format), converted them to MuseScore, and had them corrected and completed by the transcription service tunescribers.com. This involved adding thorough bass figures throughout and engraving a few missing movements from scratch. The commission was performed based on the Pepusch prints available on the International Music Score Library Project (IMSLP) which are included in the folder pdf:

Opus File IMSLP 1 Corelli op. 1 12 Triosonaten - Partitur.pdf https://imslp.org/wiki/Special:ReverseLookup/1666 3 Corelli op. 3 12 Triosonaten - Partitur.pdf https://imslp.org/wiki/Special:ReverseLookup/1689 4 Corelli op. 4 12 Triosonaten - Partitur.pdf https://imslp.org/wiki/Special:ReverseLookup/1690

(The scan of op. 3 is missing page 46, corresponding to op03n12a)

Whenever pitches, bass figures or their placement were obviously wrong they have been corrected based on the Rome princeps editions.

Caveats

Wrong positions

Two files have different time signatures in the upper and lower staff pairs which leads to wrong positions:

op03n10d has 12/8 vs. 2/2

op04n06g has 12/8 vs. 4/4

Since the parser deals only with one time signature per measure, and since positions are computed additively, the positions are currently incorrect for

all events in these two pieces which

occur in staff 3 or 4

after beat 1.

As a remedy, staves 1 and 2 could be re-written in simple meters (2/2 or 4/4) sporting triplets. For now, users could multiply mc_onset values for staves 3 and 4 by 1.5 as a remedy. The quarterbeats would then need to be re-computed by adding the stretched onset values to the MC's quarterbeat.

warnings.log

As long as this file exists, the ms3 review command has detected

incongruent phrase beginnings { and endings }, and/or

harmony labels where over 60 % of the note heads in the segment are out-of-label, and/or

other warnings have come up.

Pull requests addressing any of these warnings would be highly appreciated.

Instruments

The information on the four parts in the MuseScore files has not been curated. That concerns the staff names, brackets, behaviour of barlines, and instruments. If someone could send us a good configuration that looks and sounds decent, we would be glad to automatically apply it to the entire dataset.

License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).

Naming convention

For example, all files starting with op03n02 are movements of Sonata number 2 from opus 3. The sequence of movements is indicated by appended letters op03n02a, op03n02b, etc.

Questions, Suggestions, Corrections, Bug Reports

For questions, remarks etc., please create an issue and feel free to fork and submit pull requests.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda

Black Friday Sales EDA

Data Anlytics Project

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 29, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rushikesh Konapure

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset History

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

Tasks to perform

The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

Masked in the column description means already converted from categorical value to numerical column.

Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

DATA PREPROCESSING

Check the basic statistics of the dataset
Check for missing values in the data
Check for unique values in data
Perform EDA
Purchase Distribution
Check for outliers
Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc
Drop unnecessary fields
Convert categorical data into integer using map function (e.g 'Gender' column)
Missing value treatment
Rename columns
Fill nan values
map range variables into integers (e.g 'Age' column)

Data Visualisation

visualize individual column
Age vs Purchased
Occupation vs Purchased
Productcategory1 vs Purchased
Productcategory2 vs Purchased
Productcategory3 vs Purchased
City category pie chart
check for more possible plots

All the Best!!

Clear search

Close search

Google apps

Main menu

Black Friday Sales EDA

A global database of long-term changes in insect assemblages

InsectChange: A global database of long-term changes in insect, arachnid and...

Bangladesh rape case dataset

A global database of long-term changes in insect assemblages

A global database of long-term changes in insect assemblages

Data from: Multigenerational exposure to increased temperature reduces...

Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East...

Anion Data for the East River Watershed, Colorado (2014-2023)

Datasets of Suspended Sediment Concentration and Percent Fines (1973–2021),...

Higher Education Institutions in France Dataset

Higher Education Institutions in France Dataset

Data

Methodology

Usage

Contribution

Acknowledgment

References

De_Novo_Drug_Design.Bootstrap

Arcangelo Corelli - Sonate a tre (A corpus of annotated scores)

Black Friday Sales EDASee More Versions

Data Anlytics Project

Black Friday Sales EDA