13 datasets found
  1. Black Friday Sales EDA

    • kaggle.com
    Updated Oct 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rushikesh Konapure
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset History

    A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

    Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

    Tasks to perform

    The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

    Masked in the column description means already converted from categorical value to numerical column.

    Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

    DATA PREPROCESSING

    • Check the basic statistics of the dataset

    • Check for missing values in the data

    • Check for unique values in data

    • Perform EDA

    • Purchase Distribution

    • Check for outliers

    • Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

    • Drop unnecessary fields

    • Convert categorical data into integer using map function (e.g 'Gender' column)

    • Missing value treatment

    • Rename columns

    • Fill nan values

    • map range variables into integers (e.g 'Age' column)

    Data Visualisation

    • visualize individual column
    • Age vs Purchased
    • Occupation vs Purchased
    • Productcategory1 vs Purchased
    • Productcategory2 vs Purchased
    • Productcategory3 vs Purchased
    • City category pie chart
    • check for more possible plots

    All the Best!!

  2. e

    A global database of long-term changes in insect assemblages

    • knb.ecoinformatics.org
    • search-dev.test.dataone.org
    • +4more
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2022). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F1ZC817H
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
    Time period covered
    Jan 1, 1925 - Jan 1, 2018
    Area covered
    Pacific Ocean, North Pacific Ocean
    Variables measured
    End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 63 more
    Description

    UPDATED on October 15 2020 After some mistakes in some of the data were found, we updated this data set. The changes to the data are detailed on Zenodo (http://doi.org/10.5281/zenodo.4061807), and an Erratum has been submitted. This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 165 data sources, representing a total of 1668 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

  3. e

    InsectChange: A global database of long-term changes in insect, arachnid and...

    • knb.ecoinformatics.org
    • search.dataone.org
    • +1more
    Updated Oct 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen (2023). InsectChange: A global database of long-term changes in insect, arachnid and Entognatha assemblages [Dataset]. https://knb.ecoinformatics.org/view/urn%3Auuid%3A9c946111-05e2-48c9-afb1-2783ee43d0ed
    Explore at:
    Dataset updated
    Oct 2, 2023
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann; Minghua Shen
    Time period covered
    Jan 1, 1925 - Jan 1, 2020
    Area covered
    Pacific Ocean, North Pacific Ocean
    Variables measured
    End, Date, Link, Rank, Year, Class, Error, Genus, Level, Order, and 79 more
    Description

    UPDATE 2023 New tables have been added and old tables have been updated with new datasets Table 'Rawdata 2023.csv' contains the data as extracted from the publications, and as provided by the data owners. This table is linked to the table 'Taxondata 2023' via the column 'Taxon', to the table 'PlotData 2023' via the column 'Plot_ID' and to the table 'SampleData 2023' via the table 'Sample_ID'. This is the data from which the table 'InsectAbundanceBiomassData' was produced, but will not reproduce the exact same table as used in 2020 for the following reasons: 1) some mistakes have been corrected, 2) new datasets were added, 3) it contains other taxa than insects, arachnids and Entognatha: mollusks, worms and some vertebrates are still retained if they were present in the raw data and should be filtered out as needed, 4) other biodiversity metrics than abundance or biomass are included: density (abundance per fixed surface area), richness (number of species), Shannon (Shannon-Wiener index), Pielou (pielou's evenness index), rarefiedRichness (expected number of species for a fixed number of individuals), ENSPIE (inverted Simpson diversity index = Effective number of species of the Probability of interspecific encounter), 5) some raw data are still missing from this table, because for our newer work the raw data needed rarefaction for equalizing sampling effort. One study that was obviously incorrect has been removed (Datasource_ID 70). Table 'Taxondata 2023.csv' contains a taxonomic backbone to resolve all taxa in the table 'rawData 2023' to higher taxonomy. Note that this taxonomy is not corrected for synonyms and taxonomic changes. Nevertheless, the higher taxa are correctly assigned This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data set consists of five linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

  4. H

    Bangladesh rape case dataset

    • dataverse.harvard.edu
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sajratul Yakin Rubaiat; Sheherjan Haq (2024). Bangladesh rape case dataset [Dataset]. http://doi.org/10.7910/DVN/OE7NFR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Sajratul Yakin Rubaiat; Sheherjan Haq
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The "Bangladesh Rape Cases Data" dataset contains detailed information on rape cases reported in various districts of Bangladesh. This dataset is valuable for analyzing trends, patterns, and regional distributions of reported rape cases over a decade. It can be utilized by researchers, policymakers, and social scientists to study and address the issue of rape in Bangladesh. Total Sample Size: This dataset comprises a total of 2,813 rows, each representing an individual case reported in various districts across Bangladesh. Data Description: headline: Type: String Description: The headline of the news article reporting the rape case. It provides a brief summary of the incident. district-tag: Type: String Description: The district where the incident occurred. This helps in identifying the geographical distribution of the cases. division-tag: Type: String Description: The division of Bangladesh to which the district belongs. This is useful for broader regional analysis. subdistrict-tag: Type: String Description: The specific subdistrict or locality within the district where the incident occurred. This column may contain missing values if the subdistrict is not specified. id: Type: String (UUID format) Description: A unique identifier for each news article, ensuring that each entry can be distinctly referenced. url: Type: String Description: The web link to the original news article, allowing users to access the full report for more detailed information. last-published-at: Type: DateTime Description: The date and time when the news article was last published, helping to understand the timeline of the reported cases. offset: Type: Integer Description: An offset value for the article, potentially indicating its position in a larger dataset or the order of processing. content: Type: String Description: The main content of the news article, providing detailed information about the incident. Temporal Coverage: Minimum Date: February 22, 2013 Maximum Date: April 10, 2023 The dataset spans over a decade, allowing for a comprehensive temporal analysis of the reported cases. Potential Uses: Trend Analysis: Analyze how the frequency of reported cases changes over time. Geographical Analysis: Identify regions with higher or lower reporting rates. Content Analysis: Examine the language and details provided in the headlines and content to understand the nature of reporting. Correlation Studies: Investigate possible correlations between reported cases and other socio-economic factors. Data Quality and Considerations: Missing Values: Some columns, such as subdistrict-tag, may contain missing values where specific information was not provided. Data Source: The data is sourced from news articles, so it may be influenced by reporting biases and the availability of news coverage.

  5. e

    A global database of long-term changes in insect assemblages

    • knb.ecoinformatics.org
    Updated Oct 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. http://doi.org/10.5063/F11V5C9V
    Explore at:
    Dataset updated
    Oct 1, 2020
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
    Time period covered
    Jan 1, 1925 - Jan 1, 2018
    Area covered
    Pacific Ocean, North Pacific Ocean
    Variables measured
    End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
    Description

    This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources, representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. all references to the original data sources can be found in the pdf with references, and a Google Earth file (kml) file presents the locations (including metadata) of all datasets. When using (parts of) this data set, please respect the original open access licenses. This data set underlies all analyses performed in the paper 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes, and is accompanied by a data paper entitled 'InsectChange – a global database of temporal changes in insect and arachnid assemblages'. Consulting the data paper before use is recommended. Tables that can be used to calculate trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References_to_original_data_sources.pdf'. In the table 'PlotData' more details on each site within each dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here, but more details are provided in the data paper. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses. Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

  6. u

    A global database of long-term changes in insect assemblages

    • data.nceas.ucsb.edu
    Updated Mar 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann (2020). A global database of long-term changes in insect assemblages [Dataset]. https://data.nceas.ucsb.edu/view/urn%3Auuid%3A596c7d3c-f22b-489f-91a5-fddc03a1f481
    Explore at:
    Dataset updated
    Mar 31, 2020
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Roel van Klink; Diana E. Bowler; Jonathan M. Chase; Orr Comay; Michael M. Driessen; S.K. Morgan Ernest; Alessandro Gentile; Francis Gilbert; Konstantin Gongalsky; Jennifer Owen; Guy Pe'er; Israel Pe'er; Vincent H. Resh; Ilia Rochlin; Sebastian Schuch; Ann E. Swengel; Scott R. Swengel; Thomas L. Valone; Rikjan Vermeulen; Tyson Wepprich; Jerome Wiedmann
    Time period covered
    Jan 1, 1925 - Jan 1, 2018
    Area covered
    Pacific Ocean, North Pacific Ocean
    Variables measured
    End, Link, Year, Realm, Start, CRUmnC, CRUmnK, Metric, Number, Period, and 62 more
    Description

    This data set under CC-BY license contains time series of total abundance and/or biomass of assemblages of insect, arachnid and Entognatha assemblages (grouped at the family level or higher taxonomic resolution), monitored by standardized means for ten or more years. The data were derived from 166 data sources representing a total of 1676 sites from 41 countries. The time series for abundance and biomass represent the aggregated number of all individuals of all taxa monitored at each site. The data set consists of four linked tables, representing information on the study level, the plot level, about sampling, and the measured assemblage sizes. An additional table presents all references to the data sources, and, if applicable, the open access license under which these are published. When using (parts of) this data set, please respect the original access licenses. This data set underlies all analyses performed in 'Meta-analysis reveals declines in terrestrial, but increases in freshwater insect abundances', a meta-analysis of changes in insect assemblage sizes. Tables for calculating trends of specific taxa and for species richness will be added as they become available. The data set consists of four tables that are linked by the columns 'DataSource_ID'. and 'Plot_ID', and a table with references to original research. In the table 'DataSources', descriptive data is provided at the dataset level: Links are provided to online repositories where the original data can be found, it describes whether the dataset provides data on biomass, abundance or both, the invertebrate group under study, the realm, and describes the location of sampling at different geographic scales (continent to state). This table also contains a reference column. The full reference to the original data is found in the file 'References'. In the table 'PlotData' more details on each site within a dataset are provided: there is data on the exact location of each plot, whether the plots were experimentally manipulated, and if there was any spatial grouping of sites (column 'Location'). Additionally, this table contains all explanatory variables used for analysis, e.g. climate change variables, land-use variables, protection status. The table 'SampleData' describes the exact source of the data (table X, figure X, etc), the extraction methods, as well as the sampling methods (derived from the original publications). This includes the sampling method, sampling area, sample size, and how the aggregation of samples was done, if reported. Also, any calculations we did on the original data (e.g. reverse log transformations) are detailed here. This table links to the table 'DataSources' by the column 'DataSource_ID'. Note that each datasource may contain multiple entries in the 'SampleData' table if the data were presented in different figures or tables, or if there was any other necessity to split information on sampling details. The table 'InsectAbundanceBiomassData' provides the insect abundance or biomass numbers as analysed in the paper. It contains columns matching to the tables 'DataSources' and 'PlotData', as well as year of sampling, a descriptor of the period within the year of sampling (this was used as a random effect), the unit in which the number is reported (abundance or biomass), and the estimated abundance or biomass. In the column for Number, missing data are included (NA). The years with missing data were added because this was essential for the analysis performed, and retained here because they are easier to remove than to add. Linking the table 'InsectAbundanceBiomassData.csv' with 'PlotData.csv' by column 'Plot_ID', and with 'DataSources.csv' by column 'DataSource_ID' will provide the full dataframe used for all analyses (except for column 'Stratum', which is derived from table "SampleData'). Detailed explanations of all column headers and terms are available in the ReadMe file, and more details will be available in the forthcoming data paper. WARNING: Because of the disparate sampling methods and various spatial and temporal scales used to collect the original data, this dataset should never be used to test for differences in insect abundance/biomass among locations (i.e. differences in intercept). The data can only be used to study temporal trends, by testing for differences in slopes. The data are standardized within plots to allow the temporal comparison, but not necessarily among plots (even within one dataset).

  7. d

    Data from: Multigenerational exposure to increased temperature reduces...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emma Moffett; David Fryxell; Kevin Simon (2022). Multigenerational exposure to increased temperature reduces metabolic rate but increases boldness in Gambusia affinis [Dataset]. http://doi.org/10.7280/D1MT39
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 3, 2022
    Dataset provided by
    Dryad
    Authors
    Emma Moffett; David Fryxell; Kevin Simon
    Time period covered
    2022
    Description

    Males have no data in the pregnancy column

    Individuals that did not leave the refuge were not measured for activity and have no data in the "Time_spent_exploring_s" column NA's in dataset indicate missing data

    Column descriptions/ units:

    Site refers to the location where Gambusia were collected

    Geothermal refers to the designation of sites as geothermal or ambient, geothermal sites received warm water inputs and ambient sites experience daily and seasonal changes in environmental temperature.

    Source_Temp refers to the temperature of the site at the time of fish collection in degrees Celcius.

    Temp_lab refers to the temperature that fish were acclimated in the laboratory in degrees Celcius

    Run refers to the order in which metabolic rate was measured on fish

    Fish_ID is a unique identifier for each fish in this study.

    Sex refers to the sex of the fish measurements were done on (M = male, F = female).

    Pregnancy describes if the fish was visibly pregnant at the ti...

  8. Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East...

    • osti.gov
    • knb.ecoinformatics.org
    • +2more
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States) (2024). Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East River Watershed, Colorado (2015-2023) [Dataset]. http://doi.org/10.15485/1660459
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) (United States)
    Watershed Function SFA
    Description

    This data package contains mean values for dissolved organic carbon (DOC) and dissolved inorganic carbon (DIC) for water samples taken from the East River Watershed in Colorado. The East River is part of the Watershed Function Scientific Focus Area (WFSFA) located in the Upper Colorado River Basin, United States. DOC and DIC concentrations in water samples were determined using a TOC-VCPH analyzer (Shimadzu Corporation, Japan). DOC was analyzed as non-purgeable organic carbon (NPOC) by purging HCl acidified samples with carbon-free air to remove DIC prior to measurement. After the acidified sample has been sparged, it is injected into a combustion tube filled with oxidation catalyst heated to 680 degrees C. The DOC in samples is combusted to CO2 and measured by a non-dispersive infrared (NDIR) detector. The peak area of the analog signal produced by the NDIR detector is proportional to the DOC concentration of the sample. DIC was determined by acidifying the samples with HCl first, and then purge with carbon-free air to release CO2 for analysis by NDIR detector. All files are labeled by location and variable, and data reported are the mean values upon minimum three replicate measurements with a relative standard deviation < 3%. All samples were analyzed under a rigorous quality assurance and quality control (QA/QC) process as detailed in the methods. This data package contains (1) a zip file (dic_npoc_data_2015-2023.zip) containing a total of 323 files: 322 data files of DIC and NPOC data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (v4_20240311_flmd.csv) file that lists each file contained in the dataset with associated metadata; (3) a data dictionary (v4_20240311_dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type; and (4) PDF and docx files for the determination of Method Detection Limits (MDLs) for DIC and NPOC data, which has been updated in 2024-03. Missing values within the anion data files are noted as either "-9999" or "0.0" for not detectable (N.D.) data. There are a total of 107 locations containing DIC/NPOC data.Update on 2020-10-07: Updated the data files to remove times from the timestamps, so that only dates remain. The data values have not changed.Update on 2021-04-11: Added Determination of Method Detection Limits (MDLs) for DIC, NPOC and TDN Analyses document, which can be accessed as a PDF or with Microsoft Word.Update on 6/10/2022: versioned updates to this dataset were made along with these changes: (1) updated dissolved inorganic carbon and dissolved organic carbon data for all locations up to 2021-12-31, (2) removal of units from column headers in datafiles, (3) added row underneath headers to contain units of variables, (4) restructure of units to comply with CSV reporting format requirements, (5) added -9999 for empty numerical cells, and (6) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format.Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv).Update on 2023-08-08: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-01-05. The file level metadata and data dictionary files were updated to reflect the additional data added.Update on 2024-03-11: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-11-21. Further, revisions to the data files were made to remove incorrect data points (from 1970 and 2001). The reporting format specific files were updated to reflect the additional data added. Revised versions of the PDF and docx files for determination of MDLs for DIC and NPOC were added to replace previous versions.

  9. d

    Anion Data for the East River Watershed, Colorado (2014-2023)

    • dataone.org
    • osti.gov
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenneth Williams; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg (2024). Anion Data for the East River Watershed, Colorado (2014-2023) [Dataset]. https://dataone.org/datasets/ess-dive-7ebb5798646b4de-20240523T172944773
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    ESS-DIVE
    Authors
    Kenneth Williams; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg
    Time period covered
    May 2, 2014 - Sep 11, 2023
    Area covered
    Description

    The anion data for the East River Watershed, Colorado, consists of fluoride, chloride, sulfate, nitrate, and phosphate concentrations collected at multiple, long-term monitoring sites that include stream, groundwater, and spring sampling locations. These locations represent important and/or unique end-member locations for which solute concentrations can be diagnostic of the connection between terrestrial and aquatic systems. Such locations include drainages underlined entirely or largely by shale bedrock, land covered dominated by conifers, aspens, or meadows, and drainages impacted by historic mining activity and the presence of naturally mineralized rock. Developing a long-term record of solute concentrations from a diversity of environments is a critical component of quantifying the impacts of both climate change and discrete climate perturbations, such as drought, forest mortality, and wildfire, on the riverine export of multiple anionic species. Such data may be combined with stream gauging stations co-located at each monitoring site to directly quantify the seasonal and annual mass flux of these anionic species out of the watershed. This data package contains (1) a zip file (anion_data_2014-2023.zip) containing a total of 346 files: 345 data files of anion data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (v5_20240311_flmd.csv) file that lists each file contained in the dataset with associated metadata; and (3) a data dictionary (v5_20240311_dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type. Missing values within the anion data files are noted as either "-9999" or "0.0" for not detectable (N.D.) data. There are a total of 39 locations containing anion data. Update on 2022-06-10: versioned updates to this dataset were made along with these changes: (1) updated anion data for all locations up to 2021-12-31, (2) removal of units from column headers in datafiles, (3) added row underneath headers to contain units of variables, (4) restructure of units to comply with CSV reporting format requirements, and (5) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format. Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv). Update on 2022-12-20: Updates were made to both the data files and reporting format specific files. Conversion issues affecting ER-PLM locations for anion data was resolved for the data files. Additionally, the flmd and dd files were updated to reflect the updated versions of these files. Available data was added up until 2022-03-14. Update on 2023-08-08: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-05-19. The file level metadata and data dictionary files were updated to reflect the additional data added. Update on 2024-03-11: Updates were made to both the data files and reporting format specific files. New available anion data was added, up until 2023-09-11. Further, revisions to the data files were made to remove incorrect data points (from 1970 and 2001). The reporting format specific files were updated to reflect the additional data added.

  10. d

    Datasets of Suspended Sediment Concentration and Percent Fines (1973–2021),...

    • datasets.ai
    • data.usgs.gov
    • +1more
    55
    Updated Dec 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2022). Datasets of Suspended Sediment Concentration and Percent Fines (1973–2021), Sampling Information (1973–2021), and Daily Streamflow (1928–2021) for Sites in the Lower Mississippi and Atchafalaya Rivers to Support Analyses of Sediment Transport and Delivery [Dataset]. https://datasets.ai/datasets/datasets-of-suspended-sediment-concentration-and-percent-fines-19732021-sampling-informati
    Explore at:
    55Available download formats
    Dataset updated
    Dec 28, 2022
    Dataset authored and provided by
    Department of the Interior
    Description

    Datasets of suspended sediment concentration and percent fines, sampling information, and daily streamflow data were compiled and harmonized for 16 sites to better understand sediment transport and delivery in the Lower Mississippi and Atchafalaya Rivers. The compiled data were harmonized by removing unnecessary columns, screening data for laboratory or sampling issues, creating consistent entries for character columns, and dropping irrelevant data, among other steps. Fourteen of the sites are in the Lower Mississippi-Atchafalaya River Basin with two additional sites on the Middle Mississippi and Ohio Rivers. Suspended sediment concentration (total, all size fractions) and percent fines for multiple size fractions were retrieved from the U.S. Geological Survey (USGS) National Water Information System (NWIS) database. These data were matched to related sampling information, such as sampler type and sampling method, also retrieved from NWIS. Continuous daily streamflow was compiled (or estimated where missing) for all sites and these data were from NWIS and the U.S. Army Corps of Engineers (USACE). Daily streamflow records extend as far back as possible and contain no gaps, whereas suspended sediment data and sampling information were measured and reported periodically and may contain multiyear gaps depending on the site. Note, siteIndex is used as the main sediment site identifier since sediment records from more than one USGS site are combined for at least one siteIndex. Additionally, gageIndex is used as the main streamgage identifier since streamflow records from multiple streamgages are sometimes combined for a single gageIndex and a single gageIndex may be used for more than one siteIndex. See the siteTable.csv for linkages between siteIndex, gageIndex, and USGS and USACE site/streamgage numbers.

  11. Higher Education Institutions in France Dataset

    • zenodo.org
    zip
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jackson Barreto; Jackson Barreto; Rodrigo Costa; Rodrigo Costa (2025). Higher Education Institutions in France Dataset [Dataset]. http://doi.org/10.5281/zenodo.14960627
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jackson Barreto; Jackson Barreto; Rodrigo Costa; Rodrigo Costa
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    France
    Description

    Higher Education Institutions in France Dataset

    This repository contains a dataset of higher education institutions in France. This includes 349 higher education institutions in France, including universities, universities of applied sciences and Higher Institutes as Higher Institute of Engineering, Higher Institute of biotechnologies and few others. This dataset was compiled in response to a cybersecurity investigation of France higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].

    Data

    The data includes the following fields for each institution:

    • ETER_Id: A unique identifier assigned to each institution.
    • Name: The full name of the institution.
    • Category: Indicates whether the institution is public or private.
    • Institution_Category_Standardized: Indicates whether the institution is University, University of applied sciences or other.
    • Member_of_European_University_alliance: Indicates if the institution is member of European University Alliance (A kind of collaborative higher education institutions network in Europe).
    • Url: The website of the institution.
    • NUTS2: Nomenclature of Territorial Units for Statistics (NUTS): A classification by the European Union to divide member states' territories into statistical units. The NUTS system has three hierarchical levels, with NUTS2 being the second level.
    • NUTS2_Label_2016: Refers to the classification of regions at the NUTS2 level according to the 2016 criteria set by the European Union.
    • NUTS2_Label_2021: Refers to the classification of regions at the NUTS2 level according to the 2021 criteria set by the European Union.
    • NUTS3: Nomenclature of Territorial Units for Statistics (NUTS): A classification by the European Union to divide member states' territories into statistical units. The NUTS system has three hierarchical levels, with NUTS3 being the third level.
    • NUTS3_Label_2016: Refers to the classification of regions at the NUTS3 level according to the 2016 criteria set by the European Union.
    • NUTS3_Label_2021: Refers to the classification of regions at the NUTS3 level according to the 2021 criteria set by the European Union.

    Methodology

    The methodology for creating the dataset involved obtaining data from two sources: The European Higher Education Sector Observatory (ETER)[3]. The data was collected on December 26, 2024, the Eurostat for NUTS - Nomenclature of territorial units for statistics 2013-16[4] and 2021[5].

    This section outlines the methodology used to create the dataset for Higher Education Institutions (HEIs) in France. The dataset consolidates information from various sources, processes the data, and enriches it to provide accurate and reliable insights.

    Data Sources

    1. ETER Database: The primary dataset was sourced from the ETER database, containing detailed information about HEIs in Europe.
      • File: eter-export-2021-FR.xlsx
    2. Eurostat NUTS Data: Two datasets from Eurostat were used for regional information:
      • NUTS 2013-2016 regions: NUTS2013-NUTS2016.xlsx
      • NUTS 2021 regions: NUTS2021.xlsx

    Data Cleaning and Preprocessing Column Renaming Columns in the raw dataset were renamed for consistency and readability. Examples include:

    • ETER IDETER_ID
    • Institution NameName
    • Legal statusCategory

    Value Replacement

    1. HEI Categories: The Category column was cleaned, with government-dependent institutions classified as "public."
    2. Standardized Institution Categories: Mapped numerical values to descriptive labels such as "University" and "University of applied sciences."
    3. European University Alliance Membership: Replaced binary values with "Yes" or "No."

    Handling Missing or Incorrect Data

    1. Specific entries with missing or incorrect data were updated manually based on their ETER_ID. For instance:
      • Adjusted URLs for entries like FR0333 (updated to www.icam.fr)
      • Adjusted URLs for entries like FR0906 (updated to epss.fr)
      • Adjusted URLs for entries like FR0104 (updated to www.ensa-nancy.fr)
      • Adjusted URLs for entries like FR0466 (updated to www.clermont-auvergne-inp.fr)
      • Adjusted URLs for entries like FR0907 (updated to insp.gouv.fr) - This universety also changed your name for Institut national du service public
      • Removed entries such as FR0129 and FR0944 due to insufficient or invalid information.
      • Removed FR0513 Institut supérieur européen de gestion Lyon because it's the same url and school of Paris, so remains only the main campus in Paris
      • Remove FR0235 Institut supérieur de l'électronique et du numérique Toulon, because it's the same url of Institut supérieur de l'électronique et du numérique Lille, so remains only the main campus
      • Remove FR0106 and FR010 École spéciale militaire, because it's the url returns 403 forbiden
      • Remove FR0970 École nationale de la meteorologie beucase of invalid HTTPS

    Regional Data Integration

    1. Merged NUTS 2016 and NUTS 2021 data to enrich the dataset with regional labels.

    Final Dataset The final dataset was saved as a CSV file: france-heis.csv, encoded in UTF-8 for compatibility. It includes detailed information about HEIs in France, their categories, regional affiliations, and membership in European alliances.

    Summary This methodology ensures that the dataset is accurate, consistent, and enriched with valuable regional and institutional details. The final dataset is intended to serve as a reliable resource for analyzing French HEIs.

    Usage

    This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].

    If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862

    Contribution

    If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.

    Acknowledgment

    We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.

    References

    1. Pending
    2. S. Bezjak, A. Clyburne-Sherin, P. Conzett, P. Fernandes, E. Görögh, K. Helbig, B. Kramer, I. Labastida, K. Niemeyer, F. Psomopoulos, T. Ross-Hellauer, R. Schneider, J. Tennant, E. Verbakel, H. Brinken, and L. Heller, Open Science Training Handbook. Zenodo, Apr. 2018. [Online]. Available: [https://doi.org/10.5281/zenodo.1212496]
    3. The European Higher Education Sector Observatory, Dec 2024. Available: ETER
    4. NUTS - Nomenclature of territorial units for statistics, Dec 2024. Available: NUTS-2013-2016
    5. NUTS - Nomenclature of territorial units for statistics, Dec 2024. Available: NUTS-2021.
  12. h

    De_Novo_Drug_Design.Bootstrap

    • huggingface.co
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jeff (2025). De_Novo_Drug_Design.Bootstrap [Dataset]. https://huggingface.co/datasets/AICanada/De_Novo_Drug_Design.Bootstrap
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2025
    Authors
    jeff
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Details of AF3D_pLDDT_PChem3D_Shapes_Energy_Binding_2025_03_14 data: De novo drug design: Energy minimization is used to generate new drug molecules from scratch based on the 3D structure of the target receptor Energy minimization, also known as geometry optimization. 9329 Rows of data provided out of total of 14109 rows, Columns filtered to remove null values. Duplicate "PDB_protein_key" mapped to same duplicate "AF 3D AtomicData"(along with all data in rows) due to the many to many… See the full description on the dataset page: https://huggingface.co/datasets/AICanada/De_Novo_Drug_Design.Bootstrap.

  13. Z

    Arcangelo Corelli - Sonate a tre (A corpus of annotated scores)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Hentschel (2023). Arcangelo Corelli - Sonate a tre (A corpus of annotated scores) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7504011
    Explore at:
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Johannes Hentschel
    Fabian C. Moss
    Markus Neuwirth
    Martin Rohrmeier
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Arcangelo Corelli - Trio Sonatas (A corpus of annotated scores)

    This corpus of annotated MuseScore files has been created within the DCML corpus initiative and employs the DCML harmony annotation standard. It was relased together with and as part of the "workflow paper"

    Hentschel, J., Moss, F. C., Neuwirth, M., & Rohrmeier, M. A. (2021). A semi-automated workflow paradigm for the distributed creation and curation of expert annotations. Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, 262–269. https://doi.org/10.5281/ZENODO.5624417

    The corpus comprises 36 Sonate a tre, divided into 149 separate movements. Together they make up for three of the four famous cycles of 12 trio sonatas each:

        Opus
        Cycle
        Publication
        Included
    
    
    
    
        1
        12 sonate da chiesa
        Rome 1681
        Yes
    
    
        2
        12 sonate da camera
        Rome 1685
        No
    
    
        3
        12 sonate da chiesa
        Rome 1689
        Yes
    
    
        4
        12 sonate da camera
        Rome 1694
        Yes
    

    Versions

    Version 2.0

    TSV files now come with the column quarterbeats, which measures in quarter notes each event's position as its distance from the beginning

    Extracted notes now come with the columns name and octave.

    Column volta (containing first and second endings) removed from pieces that don't have any.

    metadata.tsv has been enriched with further columns, in particular information about each movement's dimensions, including dimensions upon unfolding repeats (for instance, last_mn has the number of measures, last_mn_unfolded the number of measures when playing all repeats)

    The folder reviewed contains two files per movement:

    A copy of the score where all out-of-label notes have been colored in red; are shown in these files in a diff-like manner (removed in red, added in green).

    A copy of the harmonies TSV with six added columns that reflect the coloring of out-of-label notes ("coloring reports")

    As long as the ms3 review has any complaints, it stores them in the file warnings.log. Currently, it is showing those labels where over 60% of the notes in the segment have been colored in red and probably need revisiting ( Pull Requests welcome)

    TSV files are automatically kept up to date using the new GitHub action dcml_corpus_workflow which is the successor of the implementation used in the creation of this dataset.

    Version 1.1

    This release marks the moment where all 149 movements include a reviewed set of annotations that adhere to version 2.3.0 of the DCML harmony annotation standard. The metadata have not been completed yet and the data were extracted one last time with the now deprecated version 0.4.11 of the MuseScore parser ms3 for matters of completeness and homogeneity. The purpose is mainly to substantiate the claim that the "semi-annotated workflow paradigm", as it had been implemented at publication time (see the ISMIR paper cited above), can indeed be put to effective use in the creation of a large dataset. This version is, however, to be followed by a version with upgraded tabular data based on the more mature ms3 > 1.0.0.

    Version 1.0

    The first release reflects the state of the dataset when finalizing chapter 4 of the workflow paper cited above.

    Getting the data

    With full version history

    The dataset is version-controlled via git. In order to download the files with all revisions they have gone through, git needs to be installed on your machine. Then you can clone this repository using the command

    git clone https://github.com/DCMLab/corelli.git

    Without full version history

    If you are only interested in the current version of the corpus, you can simply download and unpack this ZIP file.

    Data Formats

    Each piece in this corpus is represented by four files with identical names, each in its own folder. For example, the first movement of the first sonata has the following files:

    MS3/op01n01a.mscx: Uncompressed MuseScore file including the music and annotation labels.

    notes/op01n01a.tsv: A table of all note heads contained in the score and their relevant features (not each of them represents an onset, some are tied together)

    measures/op01n01a.tsv: A table with relevant information about the measures in the score.

    harmonies/op01n01a.tsv: A list of the included harmony labels (including cadences and phrases) with their positions in the score.

    Opening Scores

    After navigating to your local copy, you can open the scores in the folder MS3 with the free and open source score editor MuseScore. Please note that the scores have been edited, annotated and tested with MuseScore 3.6.2. MuseScore 4 has since been released and preliminary tests suggest that it renders them correctly.

    Opening TSV files in a spreadsheet

    Tab-separated value (TSV) files are like Comma-separated value (CSV) files and can be opened with most modern text editors. However, for correctly displaying the columns, you might want to use a spreadsheet or an addon for your favourite text editor. When you use a spreadsheet such as Excel, it might annoy you by interpreting fractions as dates. This can be circumvented by using Data --> From Text/CSV or the free alternative LibreOffice Calc. Other than that, TSV data can be loaded with every modern programming language.

    Loading TSV files in Python

    Since the TSV files contain null values, lists, fractions, and numbers that are to be treated as strings, you may want to use this code to load any TSV files related to this repository (provided you're doing it in Python). After a quick pip install -U ms3 (requires Python 3.10) you'll be able to load any TSV like this:

    import ms3

    labels = ms3.load_tsv('harmonies/op01n01a.tsv') notes = ms3.load_tsv('notes/op01n01a.tsv')

    Column names

    You can look up meaning and data type of the columns of all TSV files including metadata.tsv in ms3's documentation (simply search through the page).

    Generating all TSV files from the scores

    When you have made changes to the scores and want to update the TSV files accordingly, you can use the following command (provided you have pip-installed ms3):

    ms3 extract -M -N -X -D # for measures, notes, expanded annotations, and metadata

    If, in addition, you want to generate the reviewed scores with out-of-label notes colored in red, you can do

    ms3 review -M -N -X -D # for extracting measures, notes, expanded annotations, and metadata

    By adding the flag -c to the review command, it will additionally compare the (potentially modified) annotations in the score with the ones currently present in the harmonies TSV files and reflect the comparison in the reviewed scores.

    Score origin

    To create the dataset we downloaded the musicXML conversion available on Craig Sapp's KernScores (thanks to the engraver(s) who first encoded the scores in **kern format), converted them to MuseScore, and had them corrected and completed by the transcription service tunescribers.com. This involved adding thorough bass figures throughout and engraving a few missing movements from scratch. The commission was performed based on the Pepusch prints available on the International Music Score Library Project (IMSLP) which are included in the folder pdf:

        Opus
        File
        IMSLP
    
    
    
    
        1
        Corelli op. 1 12 Triosonaten - Partitur.pdf
        https://imslp.org/wiki/Special:ReverseLookup/1666
    
    
        3
        Corelli op. 3 12 Triosonaten - Partitur.pdf
        https://imslp.org/wiki/Special:ReverseLookup/1689
    
    
        4
        Corelli op. 4 12 Triosonaten - Partitur.pdf
        https://imslp.org/wiki/Special:ReverseLookup/1690
    

    (The scan of op. 3 is missing page 46, corresponding to op03n12a)

    Whenever pitches, bass figures or their placement were obviously wrong they have been corrected based on the Rome princeps editions.

    Caveats

    Wrong positions

    Two files have different time signatures in the upper and lower staff pairs which leads to wrong positions:

    op03n10d has 12/8 vs. 2/2

    op04n06g has 12/8 vs. 4/4

    Since the parser deals only with one time signature per measure, and since positions are computed additively, the positions are currently incorrect for

    all events in these two pieces which

    occur in staff 3 or 4

    after beat 1.

    As a remedy, staves 1 and 2 could be re-written in simple meters (2/2 or 4/4) sporting triplets. For now, users could multiply mc_onset values for staves 3 and 4 by 1.5 as a remedy. The quarterbeats would then need to be re-computed by adding the stretched onset values to the MC's quarterbeat.

    warnings.log

    As long as this file exists, the ms3 review command has detected

    incongruent phrase beginnings { and endings }, and/or

    harmony labels where over 60 % of the note heads in the segment are out-of-label, and/or

    other warnings have come up.

    Pull requests addressing any of these warnings would be highly appreciated.

    Instruments

    The information on the four parts in the MuseScore files has not been curated. That concerns the staff names, brackets, behaviour of barlines, and instruments. If someone could send us a good configuration that looks and sounds decent, we would be glad to automatically apply it to the entire dataset.

    License

    Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).

    Naming convention

    For example, all files starting with op03n02 are movements of Sonata number 2 from opus 3. The sequence of movements is indicated by appended letters op03n02a, op03n02b, etc.

    Questions, Suggestions, Corrections, Bug Reports

    For questions, remarks etc., please create an issue and feel free to fork and submit pull requests.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
Organization logo

Black Friday Sales EDA

Data Anlytics Project

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rushikesh Konapure
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset History

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

Tasks to perform

The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

Masked in the column description means already converted from categorical value to numerical column.

Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

DATA PREPROCESSING

  • Check the basic statistics of the dataset

  • Check for missing values in the data

  • Check for unique values in data

  • Perform EDA

  • Purchase Distribution

  • Check for outliers

  • Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

  • Drop unnecessary fields

  • Convert categorical data into integer using map function (e.g 'Gender' column)

  • Missing value treatment

  • Rename columns

  • Fill nan values

  • map range variables into integers (e.g 'Age' column)

Data Visualisation

  • visualize individual column
  • Age vs Purchased
  • Occupation vs Purchased
  • Productcategory1 vs Purchased
  • Productcategory2 vs Purchased
  • Productcategory3 vs Purchased
  • City category pie chart
  • check for more possible plots

All the Best!!

Search
Clear search
Close search
Google apps
Main menu