65 datasets found
  1. Parkinson_csv

    • kaggle.com
    zip
    Updated Sep 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar Bapodara (2021). Parkinson_csv [Dataset]. https://www.kaggle.com/datasets/sagarbapodara/parkinson-csv/discussion
    Explore at:
    zip(15986 bytes)Available download formats
    Dataset updated
    Sep 12, 2021
    Authors
    Sagar Bapodara
    Description

    Dataset

    This dataset is the CSV Version of the Original Parkison Dataset found at https://www.kaggle.com/nidaguler/parkinsons-data-set

    Content

    Title: Parkinson's Disease Data Set

    Abstract: Oxford Parkinson's Disease Detection Dataset

    Source

    The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

    Dataset Info

    This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

    The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

    Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

    Attribute Information:

    Matrix column entries (attributes): name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude NHR,HNR - Two measures of ratio of noise to tonal components in the voice status - Health status of the subject (one) - Parkinson's, (zero) - healthy RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

    Citation Request:

    If you use this dataset, please cite the following paper: 'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

  2. ENTSO-E Hydropower modelling data (PECD) in CSV format

    • zenodo.org
    csv
    Updated Aug 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3949757
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 14, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matteo De Felice; Matteo De Felice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PECD Hydro modelling

    This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

    The original URLs:

    The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

    As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

    Data description

    The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

    In this repository you can find 6 CSV files:

    • PECD-hydro-capacities.csv: installed capacities
    • PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping
    • PECD-hydro-daily-ror-generation.csv: daily run-of-river generation
    • PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation
    • PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

    Capacities

    The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5
    • sheet Reservoir, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3
    • sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

    Inflows

    The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 16 to 51
    • sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

    Daily run-of-river

    The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

    Miminum and maximum reservoir generation

    The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 13 to 66, columns from 196 to 231
    • sheet Reservoir, rows from 13 to 66, columns from 232 to 267

    Minimum/Maximum reservoir levels

    The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

    • sheet Reservoir, rows from 14 to 66, column 12
    • sheet Reservoir, rows from 14 to 66, column 13

    CHANGELOG

    [2020/07/17] Added maximum generation for the reservoir

  3. 2022 Bikeshare Data -Reduced File Size -All Months

    • kaggle.com
    zip
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kendall Marie (2023). 2022 Bikeshare Data -Reduced File Size -All Months [Dataset]. https://www.kaggle.com/datasets/kendallmarie/2022-bikeshare-data-all-months-combined
    Explore at:
    zip(98884 bytes)Available download formats
    Dataset updated
    Mar 8, 2023
    Authors
    Kendall Marie
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This is a condensed version of the raw data obtained through the Google Data Analytics Course, made available by Lyft and the City of Chicago under this license (https://ride.divvybikes.com/data-license-agreement).

    I originally did my study in another platform, and the original files were too large to upload to Posit Cloud in full. Each of the 12 monthly files contained anywhere from 100k to 800k rows. Therefore, I decided to reduce the number of rows drastically by performing grouping, summaries, and thoughtful omissions in Excel for each csv file. What I have uploaded here is the result of that process.

    Data is grouped by: month, day, rider_type, bike_type, and time_of_day. total_rides represent the sum of the data in each grouping as well as the total number of rows that were combined to make the new summarized row, avg_ride_length is the calculated average of all data in each grouping.

    Be sure that you use weighted averages if you want to calculate the mean of avg_ride_length for different subgroups as the values in this file are already averages of the summarized groups. You can include the total_rides value in your weighted average calculation to weigh properly.

    9 Columns:

    date - year, month, and day in date format - includes all days in 2022 day_of_week - Actual day of week as character. Set up a new sort order if needed. rider_type - values are either 'casual', those who pay per ride, or 'member', for riders who have annual memberships. bike_type - Values are 'classic' (non-electric, traditional bikes), or 'electric' (e-bikes). time_of_day - this divides the day into 6 equal time frames, 4 hours each, starting at 12AM. Each individual ride was placed into one of these time frames using the time they STARTED their rides, even if the ride was long enough to end in a later time frame. This column was added to help summarize the original dataset. total_rides - Count of all individual rides in each grouping (row). This column was added to help summarize the original dataset. avg_ride_length - The calculated average of all rides in each grouping (row). Look to total_rides to know how many original rides length values were included in this average. This column was added to help summarize the original dataset. min_ride_length - Minimum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset. max_ride_length - Maximum ride length of all rides in each grouping (row). This column was added to help summarize the original dataset.

    Please note: the time_of_day column has inconsistent spacing. Use mutate(time_of_day = gsub(" ", "", time_of _day)) to remove all spaces.

    Revisions

    Below is the list of revisions I made in Excel before uploading the final csv files to the R environment:

    • Deleted station location columns and lat/long as much of this data was already missing.

    • Deleted ride id column since each observation was unique and I would not be joining with another table on this variable.

    • Deleted rows pertaining to "docked bikes" since there were no member entries for this type and I could not compare member vs casual rider data. I also received no information in the project details about what constitutes a "docked" bike.

    • Used ride start time and end time to calculate a new column called ride_length (by subtracting), and deleted all rows with 0 and 1 minute results, which were explained in the project outline as being related to staff tasks rather than users. An example would be taking a bike out of rotation for maintenance.

    • Placed start time into a range of times (time_of_day) in order to group more observations while maintaining general time data. time_of_day now represents a time frame when the bike ride BEGAN. I created six 4-hour time frames, beginning at 12AM.

    • Added a Day of Week column, with Sunday = 1 and Saturday = 7, then changed from numbers to the actual day names.

    • Used pivot tables to group total_rides, avg_ride_length, min_ride_length, and max_ride_length by date, rider_type, bike_type, and time_of_day.

    • Combined into one csv file with all months, containing less than 9,000 rows (instead of several million)

  4. Z

    Types, open citations, closed citations, publishers, and participation...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiebi, Ivan; Peroni, Silvio; Shotton, David (2020). Types, open citations, closed citations, publishers, and participation reports of Crossref entities [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2558257
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom
    Digital Humanities Advanced Research Centre, Department of Computer Science and Engineering, University of Bologna, Bologna, Italy
    Authors
    Hiebi, Ivan; Peroni, Silvio; Shotton, David
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This publication contains several datasets that have been used in the paper "Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal" submitted to the 17th International Conference on Scientometrics and Bibliometrics (ISSI 2019), available at https://opencitations.wordpress.com/2019/02/07/crowdsourcing-open-citations-with-croci/.

    Additional information about the analyses described in the paper, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb. The datasets contain the following information.

    non_open.zip: it is a zipped (~5 GB unzipped) CSV file containing the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, dated October 2018. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other. The open CC0 citation data we used came from the CSV dump of most recent release of COCI dated 12 November 2018. The number of closed citations was calculated by subtracting the number of open citations to each entity available within COCI from the value “is-referenced-by-count” available in the Crossref metadata for that particular cited entity, which reports all the DOI-to-DOI citation links that point to the cited entity from within the whole Crossref database (including those present in the Crossref ‘closed’ dataset).

    The columns of the CSV file are the following ones:

    doi: the DOI of the publication in Crossref;

    type: the type of the publication as indicated in Crossref;

    cited_by: the number of open citations received by the publication according to COCI;

    non_open: the number of closed citations received by the publication according to Crossref + COCI.

    croci_types.csv: it is a CSV file that contains the numbers of open citations and closed citations received by the entities in the Crossref dump used in our computation, as collected in the previous CSV file, alligned in five classes depening on the entity types retrieved from Crossref: journal (Crossref types: journal-article, journal-issue, journal-volume, journal), book (Crossref types: book, book-chapter, book-section, monograph, book track, book-part, book-set, reference-book, dissertation, book series, edited book), proceedings (Crossref types: proceedings-article, proceedings, proceedings-series), dataset (Crossref types: dataset), other (Crossref types: other, report, peer review, reference-entry, component, report-series, standard, posted-content, standard-series).

    The columns of the CSV file are the following ones:

    type: the type publication between "journal", "book", "proceedings", "dataset", "other";

    label: the label assigned to the type for visualisation purposes;

    coci_open_cit: the number of open citations received by the publication type according to COCI;

    crossref_close_cit: the number of closed citations received by the publication according to Crossref + COCI.

    publishers_cits.csv: it is a CSV file that contains the top twenty publishers that received the greatest number of open citations. The columns of the CSV file are the following ones:

    publisher: the name of the publisher;

    doi_prefix: the list of DOI prefixes used assigned by the publisher;

    coci_open_cit: the number of open citations received by the publications of the publisher according to COCI;

    crossref_close_cit: the number of closed citations received by the publications of the publishers according to Crossref + COCI;

    total_cit: the total number of citations received by the publications of the publisher (= coci_open_cit + crossref_close_cit).

    20publishers_cr.csv: it is a CSV file that contains the numbers of the contributions to open citations made by the twenty publishers introduced in the previous CSV file as of 24 January 2018, according to the data available through the Crossref API. The counts listed in this file refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list. The categories 'closed', 'limited' and 'open' refer to publications for which the reference lists are not visible to anyone outside the Crossref Cited-by membership, are visible only to them and to Crossref Metadata Plus members, or are visible to all, respectively. In addition, the file also record the total number of publications for which the publisher has submitted metadata to Crossref, whether or not those metadata include the reference lists of those publications.

    The columns of the CSV file are the following ones:

    publisher: the name of the publisher;

    open: the number of publications in Crossref with an 'open' visibility for their reference lists;

    limited: the number of publications in Crossref with an 'limited' visibility for their reference lists;

    closed: the number of publications in Crossref with an 'closed' visibility for their reference lists;

    overall_deposited: the overall number of publications for which the publisher has submitted metadata to Crossref.

  5. s

    Annual maps of cropland abandonment, land cover, and other derived data for...

    • repository.soilwise-he.eu
    • data.niaid.nih.gov
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Annual maps of cropland abandonment, land cover, and other derived data for time-series analysis of cropland abandonment [Dataset]. http://doi.org/10.5281/zenodo.5348287
    Explore at:
    Dataset updated
    Apr 2, 2022
    Description

    Open AccessThis archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support: Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999. An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127). Note that the label '_2022_02_07' in many file names refers to the date of the primary analysis. 'dts” or “dt” refer to “data.tables,' large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster). Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example): observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv); potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv); maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv); recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv). This archive includes multiple .zip files, the contents of which are described below: age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with 'y' followed by the year, e.g., 'y1987'). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions 'cc_filter_abn_dt(),' “cc_calc_max_age(),' “cc_calc_potential_age(),” and “cc_calc_recult_age();” see '_util/_util_functions.R.' age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in 'age_dts.zip.” derived_data.zip - summary datasets created throughout this analysis, listed below. diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in 'input_data.tables.zip' and 'age_dts.zip.' These files were produced using the custom function 'cc_diff_dt()' (much like the base R function 'diff()'), contained within the custom function 'cc_filter_abn_dt()' (see '_util/_util_functions.R'). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites. input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., 'y1987'). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in 'input_rasters.zip,' and serve as the primary inputs for the analysis. input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as '.tif' files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in 'input_dts.zip.' length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function 'cc_filter_abn_dt()' and “cc_extract_length();” see '_util/_util_functions.R.' derived_data.zip contains the following files: 'site_df.csv' - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below). Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”). area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions 'cc_calc_area_per_lc_abn()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given 'cohort' of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called 'year_abn') in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ('initial_area_abn'). The 'age' is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and 'time' is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions 'cc_calc_persistence()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' This serves as the main input for our linear models of recultivation (“decay”) trajectories. turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions 'cc_calc_abn_diff()' via 'cc_summarize_abn_dts()' (see '_util/_util_functions.R'), implemented in 'cluster/2_analyze_abn.R.' This file is only produced for observed abandonment. Area summary files (for observed abandonment only) area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) 'site,' 2) 'total_site_area_ha_2017' - the total site area (ha) in 2017, 3) 'cropland_area_1987' - the area in cropland in 1987 (ha), 4) 'area_abn_ha_2017' -

  6. UFC Records Dataset

    • kaggle.com
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ustice (2025). UFC Records Dataset [Dataset]. https://www.kaggle.com/datasets/ustice/ufc-records-dataset/discussion?sort=undefined
    Explore at:
    zip(5993 bytes)Available download formats
    Dataset updated
    Mar 5, 2025
    Authors
    ustice
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset is up to date as of March 2025

    For this project I scraped data from the offcial site of UFC using python's BeatuitifulSoup package. The data I extracted was fighter career records. Once I had extracted the data I ran into some issues cleaning the data as some of the records held by fighters are time based records such as "Shortest Avgerage Fight Time" The timecodes were orignally of data-type string but I managed to convert it into a float that represents the time in minutes after I created a function to convert into minutes. All the other numerical data were also string data types but those we easily converted to floats as well. Based on my anaylisis and statistics one could definitely make a strong argument fir Georges St. Pierre being the best UFC fighter of all time. However it is never clear cut, many other arguments could be made. He may have broken many records but does that mean in his prime he could beat any other fighter? Although hes on the board for highest win streak and amount won title fights, hes not ranked 1 in either of these categories. A lot of people believe that Khabib Nurmagomedov as he has an unfeated record of 29-0-0. However, his win streak record is rank number 6 with only a win streak of 13 because he only fought 13 times in the UFC his other wins were in previoous leagues he was apart of. In conclution, given the data I have presented many conclutions and arguments can be made for who is the true ultimate fighter champion. Look at the data and decide for yourself.

  7. AI Earnings Call Transcript Database

    • kaggle.com
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robel Kidane (2024). AI Earnings Call Transcript Database [Dataset]. https://www.kaggle.com/datasets/robello/ai-earnings-call-transcript-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Robel Kidane
    Description

    Unlock invaluable insights into the AI strategies and discussions straight from the executive suites of leading companies.

    This comprehensive CSV file contains over 2000 rows of detailed commentary on AI mentions during earnings calls, meticulously organized to provide maximum value to investors, analysts, data scientists, business strategists, researchers, and technology consultants.

    For full access, submit email to https://docs.google.com/forms/d/e/1FAIpQLSdmZDOziISUENDDtNlJPuNfgPctUKsylD_vuUm0U4R9XBWOVw/viewform?usp=sf_link

  8. d

    US B2B Contact Data | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forager.ai, US B2B Contact Data | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON [Dataset]. https://datarade.ai/data-products/us-b2b-contact-data-180m-records-bi-weekly-updates-csv-forager-ai
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Forager.ai
    Area covered
    United States of America
    Description

    US B2B Contact Database | 200M+ Verified Records | 95% Accuracy | API/CSV/JSON Elevate your sales and marketing efforts with America's most comprehensive B2B contact data, featuring over 200M+ verified records of decision-makers, from CEOs to managers, across all industries. Powered by AI and refreshed bi-weekly, this dataset ensures you have access to the freshest, most accurate contact details available for effective outreach and engagement.

    Key Features & Stats:

    200M+ Decision-Makers: Includes C-level executives, VPs, Directors, and Managers.

    95% Accuracy: Email & Phone numbers verified for maximum deliverability.

    Bi-Weekly Updates: Never waste time on outdated leads with our frequent data refreshes.

    50+ Data Points: Comprehensive firmographic, technographic, and contact details.

    Core Fields:

    Direct Work Emails & Personal Emails for effective outreach.

    Mobile Phone Numbers for cold calls and SMS campaigns.

    Full Name, Job Title, Seniority for better personalization.

    Company Insights: Size, Revenue, Funding data, Industry, and Tech Stack for a complete profile.

    Location: HQ and regional offices to target local, national, or international markets.

    Top Use Cases:

    Cold Email & Calling Campaigns: Target the right people with accurate contact data.

    CRM & Marketing Automation Enrichment: Enhance your CRM with enriched data for better lead management.

    ABM & Sales Intelligence: Target the right decision-makers and personalize your approach.

    Recruiting & Talent Mapping: Access CEO and senior leadership data for executive search.

    Instant Delivery Options:

    JSON – Bulk downloads via S3 for easy integration.

    REST API – Real-time integration for seamless workflow automation.

    CRM Sync – Direct integration with your CRM for streamlined lead management.

    Enterprise-Grade Quality:

    SOC 2 Compliant: Ensuring the highest standards of security and data privacy.

    GDPR/CCPA Ready: Fully compliant with global data protection regulations.

    Triple-Verification Process: Ensuring the accuracy and deliverability of every record.

    Suppression List Management: Eliminate irrelevant or non-opt-in contacts from your outreach.

    US Business Contacts | B2B Email Database | Sales Leads | CRM Enrichment | Verified Phone Numbers | ABM Data | CEO Contact Data | US B2B Leads | US prospects data

  9. u

    Data from: KGCW 2023 Challenge @ ESWC 2023

    • investigacion.usc.gal
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana (2023). KGCW 2023 Challenge @ ESWC 2023 [Dataset]. https://investigacion.usc.gal/documentos/67321d88aea56d4af0484859
    Explore at:
    Dataset updated
    2023
    Authors
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Şimşek, Umutcan; Iglesias, Ana
    Description

    Knowledge Graph Construction Workshop 2023: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different stepsfor each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format.

    The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation.

    The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso.

    The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Query timeout is set to 1 hour and knowledge graph construction timeoutto 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with thefollowing files:

    Input dataset as CSV.

    Mapping file as RML.

    Queries as SPARQL.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    SPARQL queries to retrieve the results for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500020 triples

    50 percent 1000020 triples

    75 percent 500020 triples

    100 percent 20 triples

    Empty values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500000 triples

    50 percent 1000000 triples

    75 percent 500000 triples

    100 percent 0 triples

    Mappings

    Scale Number of Triples

    1TM + 15POM 1500000 triples

    3TM + 5POM 1500000 triples

    5TM + 3POM 1500000 triples

    15TM + 1POM 1500000 triples

    Properties

    Scale Number of Triples

    1M rows 1 column 1000000 triples

    1M rows 10 columns 10000000 triples

    1M rows 20 columns 20000000 triples

    1M rows 30 columns 30000000 triples

    Records

    Scale Number of Triples

    10K rows 20 columns 200000 triples

    100K rows 20 columns 2000000 triples

    1M rows 20 columns 20000000 triples

    10M rows 20 columns 200000000 triples

    Joins

    1-1 joins

    Scale Number of Triples

    0 percent 0 triples

    25 percent 125000 triples

    50 percent 250000 triples

    75 percent 375000 triples

    100 percent 500000 triples

    1-N joins

    Scale Number of Triples

    1-10 0 percent 0 triples

    1-10 25 percent 125000 triples

    1-10 50 percent 250000 triples

    1-10 75 percent 375000 triples

    1-10 100 percent 500000 triples

    1-5 50 percent 250000 triples

    1-10 50 percent 250000 triples

    1-15 50 percent 250005 triples

    1-20 50 percent 250000 triples

    1-N joins

    Scale Number of Triples

    10-1 0 percent 0 triples

    10-1 25 percent 125000 triples

    10-1 50 percent 250000 triples

    10-1 75 percent 375000 triples

    10-1 100 percent 500000 triples

    5-1 50 percent 250000 triples

    10-1 50 percent 250000 triples

    15-1 50 percent 250005 triples

    20-1 50 percent 250000 triples

    N-M joins

    Scale Number of Triples

    5-5 50 percent 1374085 triples

    10-5 50 percent 1375185 triples

    5-10 50 percent 1375290 triples

    5-5 25 percent 718785 triples

    5-5 50 percent 1374085 triples

    5-5 75 percent 1968100 triples

    5-5 100 percent 2500000 triples

    5-10 25 percent 719310 triples

    5-10 50 percent 1375290 triples

    5-10 75 percent 1967660 triples

    5-10 100 percent 2500000 triples

    10-5 25 percent 719370 triples

    10-5 50 percent 1375185 triples

    10-5 75 percent 1968235 triples

    10-5 100 percent 2500000 triples

    GTFS Madrid Bench

    Generated Knowledge Graph

    Scale Number of Triples

    1 395953 triples

    10 3959530 triples

    100 39595300 triples

    1000 395953000 triples

    Queries

    Query Scale 1 Scale 10 Scale 100 Scale 1000

    Q1 58540 results 585400 results No results available No results available

    Q2 636 results 11998 results
    125565 results 1261368 results

    Q3 421 results 4207 results 42067 results 420667 results

    Q4 13 results 130 results 1300 results 13000 results

    Q5 35 results 350 results 3500 results 35000 results

    Q6 1 result 1 result 1 result 1 result

    Q7 68 results 67 results 67 results 53 results

    Q8 35460 results 354600 results No results available No results available

    Q9 130 results 1300 results 13000 results 130000 results

    Q10 1 result 1 result 1 result 1 result

    Q11 130 results 260 results 260 results 260 results

    Q12 13 results 130 results 1300 results 13000 results

    Q13 265 results 2650 results 26500 results 265000 results

    Q14 2234 results 22340 results 223400 results No results available

    Q15 592 results 8684 results 35502 results 206628 results

    Q16 390 results 780 results 260 results 780 results

    Q17 855 results 8550 results 85500 results 855000 results

    Q18 104 results 1300 results 13000 results 130000 results

  10. n

    Hourly electricity load profiles of paper producing and food processing...

    • narcis.nl
    • data.mendeley.com
    Updated Mar 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valdes, J (via Mendeley Data) (2021). Hourly electricity load profiles of paper producing and food processing industries [Dataset]. http://doi.org/10.17632/ttx9chkdcg.1
    Explore at:
    Dataset updated
    Mar 19, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Valdes, J (via Mendeley Data)
    Description

    The data provided are synthetic hourly electricity load profiles for the paper and food industries for one year. The data have been synthetized from two years of measured data from industries in Chile using a comprehensive clustering analysis. The synthetic data possess the same statistical characteristics as the measured data but are provided normalized to one kW and anonymized in order to be used without confidentiality issues. Three CSV files are provided: food_i.csv, paper_i_small.csv and paper_i_large.csv containing the data of a small food processing industry, a small paper industry, and a medium-large paper industry, respectively. All the three files contain seven columns of data: weekday, month, hour, cluster, min, max, mean. The four first columns index the data in the following way:

    Month: it includes the range of integer values between 1 and 12 accounting for the consecutive calendar months of a year starting in January (1) and ending in December (12).
    Weekday: this column has integer values in the range 1 to 7 that are equivalent to the consecutive days of the week starting on Monday (1) and ending on Sunday (7). Hour: it consist of integer values ranging between 1 and 24, which describe the hours of a day. Cluster: The column “cluster” represents the cluster to which this data is associated to. The number of clusters is different for each load profile, as well as the number of days included in each cluster. Since the cluster were calculated for days, a cluster number covers 24 consecutive points of data.

    The load profile data are provided in the three different columns: min, max and mean:

    Min: this column provides the min value of the cluster at that time of the day. Therefore, it represents the minimum demand of electricity recorded in all the days belonging to this representative group of data.

    Max. This column provides the maximum electric load of the cluster at that time of the day. It represents the maximum demand of electricity in all the days belonging to this representative group of data at that hour of the day.

    Mean: This column provides the average electric load of the cluster at that time of the day. It represents the mean demand for electricity belonging to this representative group of data at that hour of the day.

    The min, max and mean values are different for each hour of the day. All values are provided in values from 0 to 1 with the unit kW.
    For details on the clustering procedure or the data itself please refer to the associated paper published in the journal Energy and the one published in Data in Brief journal.

    The study was supported by the German Federal Ministry of Education and Research - BMBF and the Chilean National Commission for Scientific Research and Technology - CONICYT (grant number BMBF150075) , the Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH through the Energy Program in Chile, and the European Research Council (“reFUEL” ERC-2017-STG 758149).

  11. l

    LScDC Word-Category RIG Matrix

    • figshare.le.ac.uk
    pdf
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

  12. Milan AirQuality and Weather Dataset(daily&hourly)

    • kaggle.com
    zip
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Mosca (2025). Milan AirQuality and Weather Dataset(daily&hourly) [Dataset]. https://www.kaggle.com/datasets/edmos07/milan-air-quality-and-weather-dataset-daily
    Explore at:
    zip(5079748 bytes)Available download formats
    Dataset updated
    Mar 21, 2025
    Authors
    Eduardo Mosca
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [WARNING: in-depth description for hourly data is missing at the moment. Please refer to the Open-Meteo website(Air Quality and Historical Weather APIs specifically) for descriptions on columns included in hourly data for the time being. In short though, the hourly data info can be obtained from the daily data info, as the horly data is used to construct the daily data; example: if avg_nitrogen_dioxide is the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day, the "nitrogen_dioxide" column will consist of the hourly instant measurements of nitrogen dioxide (10 meters above ground in μg/m3.]

    Result of a course project in the context of the Master's Degree in Data Science at Università Degli Studi di Milano-Bicocca. The dataset was built in hopes of finding ways to tackle the bad air quality for which Milan is becoming renown for, and to make the training of ML models possible. The data was collected through Open-Meteo's APIs, who in turn got it from "Reanalyses Models" of Europea initiative, used for weather and air quality forecast. The data used was validated by the owners of the reanalyses datasets from which the data comes from, and through the construction of this specific dataset it's data quality was assessed across accuracy, completeness and consistency dimensions. We aggregated the data from hourly to daily, it is possible to consult the entire Data Management process in the attached pdf.

    File descriptions: - weatheraqDataset.csv : contains DAILY data on weather and air quality for the city of Milan in comma separateda values (csv) format. - weatheraqDataset_Report.pdf : report built to illustrate and explicit the process followed in order to build the final dataset starting from the original data sources; it also explains any processing and aggregation/integration operations carried out. - weatheraqHourly.csv : HOURLY data, counterpart to those in daily dataset(daily data is result of aggregation of hourly data). Higher granularity and number of rows can help with achieving better results, for detailed descriptions on how these hourly values are recorded and at what resolutions please visit the OpenMeteo website as stated in the warning at the start of the description.

    GitHub repo of the project: https://github.com/edmos7/weather-aqMilan

    Column descriptions for DAILY data (weatheraqDataset.csv):

    note: both 'date' in DAILY data and 'datetime' in HOURLY data is in local Milan Time(CET&CEST), adjusted with Daylight Savings(DST).

    • date: refers to day in calendar year which other values are relative to. YYYY-MM-DD format, in Milan local time.
    • avg_nitrogen_dioxide : the average of the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day.
    • max_nitrogen_dioxide : the maximum value among the hourly instant(10 meters above ground in μg/m3) nitrogen dioxide values for a particular day
    • min_nitrogen_dioxide : the minimum value among the hourly instant (10 meters above ground in μg/m3) nitrogen dioxidevalues for a particular day
    • max_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their maximum, HH:MM:SS
    • min_time_nitrogen_dioxide : hour at which hourly nitrogen dioxide values reached their minimum, HH:MM:SS NOTE: all other "pollutant" columns (pm10, pm2_5, sulphur_dioxide, ozone) follow same structure as the above unless specified below.
    • pm2_5_avgdRolls : the average of the 24hr rolling averages for particulate matter with diameter below μg/m (pm2.5), in a particular day. Rolling averages are used to compute the European Air Quality Index(EAQI) in a given moment, so in the computation of our Daily EAQI, averages of rolling averages were used. NOTE: the above goes also for the 'pm10_avgdRolls' field.
    • eaqi : the computed air quality level according to European Environment Agency thresholds, considering daily averages for ozone, sulphur dioxide and nitrogen dioxide, and average of daily rolling averages for pm10 and pm2.5. The value corresponds to the highest level among single pollutant levels.
    • nitrogen_dioxide_eaqi : the air quality level computed through EAQI thresholds for nitrogen dioxide individually, all other [pollutant]_eaqi fields follow same reasoning.
    • avg_temperature_2m: average of hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • max_temperature_2m: maximum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • min_temperature_2m: minimum among hourly air temperatures recorded at 2 meters above ground level for the day(°C);
    • avg_relative_humidity_2m: average of hourly humidity recorded at 2 meters above ground level for the day(%);
    • avg_dew_point_2m: average of hourly dew point temperatures recorded at 2 meters above ground for the day(°C);
    • avg_apparent_temperature: average of hourly...
  13. u

    KGCW 2024 Challenge @ ESWC 2024

    • investigacion.usc.gal
    • investigacion.usc.es
    • +3more
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://investigacion.usc.gal/documentos/668fc40fb9e7c03b01bd383a?lang=de
    Explore at:
    Dataset updated
    2024
    Authors
    Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana; Van Assche, Dylan; Chaves-Fraga, David; Dimou, Anastasia; Serles, Umutcan; Iglesias, Ana
    Description

    Knowledge Graph Construction Workshop 2024: challenge

    Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

    Task description

    The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

    We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

    Track 1: Conformance

    The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

    RML-Core

    RML-IO

    RML-CC

    RML-FNML

    RML-Star

    These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

    Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

    Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

    Track 2: Performance

    Part 1: Knowledge Graph Construction Parameters

    These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

    Data

    Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

    Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

    Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

    Number of input files: scaling the number of datasets (1, 5, 10, 15).

    Mappings

    Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

    Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

    Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

    Part 2: GTFS-Madrid-Bench

    The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

    Scaling

    GTFS-1 SQL

    GTFS-10 SQL

    GTFS-100 SQL

    GTFS-1000 SQL

    Heterogeneity

    GTFS-100 XML + JSON

    GTFS-100 CSV + XML

    GTFS-100 CSV + JSON

    GTFS-100 SQL + XML + JSON + CSV

    Example pipeline

    The ground truth dataset and baseline results are generated in different stepsfor each parameter:

    The provided CSV files and SQL schema are loaded into a MySQL relational database.

    Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

    The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

    Each parameter has its own directory in the ground truth dataset with thefollowing files:

    Input dataset as CSV.

    Mapping file as RML.

    Execution plan for the pipeline in metadata.json.

    Datasets

    Knowledge Graph Construction Parameters

    The dataset consists of:

    Input dataset as CSV for each parameter.

    Mapping file as RML for each parameter.

    Baseline results for each parameter with the example pipeline.

    Ground truth dataset for each parameter generated with the example pipeline.

    Format

    All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

    GTFS-Madrid-Bench

    The dataset consists of:

    Input dataset as CSV with SQL schema for the scaling and a combination of XML,

    CSV, and JSON is provided for the heterogeneity.

    Mapping file as RML for both scaling and heterogeneity.

    SPARQL queries to retrieve the results.

    Baseline results with the example pipeline.

    Ground truth dataset generated with the example pipeline.

    Format

    CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

    Evaluation criteria

    Submissions must evaluate the following metrics:

    Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

    CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

    Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

    Expected output

    Duplicate values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500020 triples

    50 percent 1000020 triples

    75 percent 500020 triples

    100 percent 20 triples

    Empty values

    Scale Number of Triples

    0 percent 2000000 triples

    25 percent 1500000 triples

    50 percent 1000000 triples

    75 percent 500000 triples

    100 percent 0 triples

    Mappings

    Scale Number of Triples

    1TM + 15POM 1500000 triples

    3TM + 5POM 1500000 triples

    5TM + 3POM 1500000 triples

    15TM + 1POM 1500000 triples

    Properties

    Scale Number of Triples

    1M rows 1 column 1000000 triples

    1M rows 10 columns 10000000 triples

    1M rows 20 columns 20000000 triples

    1M rows 30 columns 30000000 triples

    Records

    Scale Number of Triples

    10K rows 20 columns 200000 triples

    100K rows 20 columns 2000000 triples

    1M rows 20 columns 20000000 triples

    10M rows 20 columns 200000000 triples

    Joins

    1-1 joins

    Scale Number of Triples

    0 percent 0 triples

    25 percent 125000 triples

    50 percent 250000 triples

    75 percent 375000 triples

    100 percent 500000 triples

    1-N joins

    Scale Number of Triples

    1-10 0 percent 0 triples

    1-10 25 percent 125000 triples

    1-10 50 percent 250000 triples

    1-10 75 percent 375000 triples

    1-10 100 percent 500000 triples

    1-5 50 percent 250000 triples

    1-10 50 percent 250000 triples

    1-15 50 percent 250005 triples

    1-20 50 percent 250000 triples

    1-N joins

    Scale Number of Triples

    10-1 0 percent 0 triples

    10-1 25 percent 125000 triples

    10-1 50 percent 250000 triples

    10-1 75 percent 375000 triples

    10-1 100 percent 500000 triples

    5-1 50 percent 250000 triples

    10-1 50 percent 250000 triples

    15-1 50 percent 250005 triples

    20-1 50 percent 250000 triples

    N-M joins

    Scale Number of Triples

    5-5 50 percent 1374085 triples

    10-5 50 percent 1375185 triples

    5-10 50 percent 1375290 triples

    5-5 25 percent 718785 triples

    5-5 50 percent 1374085 triples

    5-5 75 percent 1968100 triples

    5-5 100 percent 2500000 triples

    5-10 25 percent 719310 triples

    5-10 50 percent 1375290 triples

    5-10 75 percent 1967660 triples

    5-10 100 percent 2500000 triples

    10-5 25 percent 719370 triples

    10-5 50 percent 1375185 triples

    10-5 75 percent 1968235 triples

    10-5 100 percent 2500000 triples

    GTFS Madrid Bench

    Generated Knowledge Graph

    Scale Number of Triples

    1 395953 triples

    10 3959530 triples

    100 39595300 triples

    1000 395953000 triples

    Queries

    Query Scale 1 Scale 10 Scale 100 Scale 1000

    Q1 58540 results 585400 results No results available No results available

    Q2 636 results 11998 results
    125565 results 1261368 results

    Q3 421 results 4207 results 42067 results 420667 results

    Q4 13 results 130 results 1300 results 13000 results

    Q5 35 results 350 results 3500 results 35000 results

    Q6 1 result 1 result 1 result 1 result

    Q7 68 results 67 results 67 results 53 results

    Q8 35460 results 354600 results No results available No results available

    Q9 130 results 1300

  14. d

    Geochemical data supporting analysis of geochemical conditions and nitrogen...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Geochemical data supporting analysis of geochemical conditions and nitrogen transport in nearshore groundwater and the subterranean estuary at a Cape Cod embayment, East Falmouth, Massachusetts, 2013 [Dataset]. https://catalog.data.gov/dataset/geochemical-data-supporting-analysis-of-geochemical-conditions-and-nitrogen-transport-in-n
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Cape Cod, East Falmouth, Falmouth, Massachusetts
    Description

    This data release provides analytical and other data in support of an analysis of nitrogen transport and transformation in groundwater and in a subterranean estuary in the Eel River and onshore locations on the Seacoast Shores peninsula, Falmouth, Massachusetts. The analysis is described in U.S. Geological Survey Scientific Investigations Report 2018-5095 by Colman and others (2018). This data release is structured as a set of comma-separated values (CSV) files, each of which contains data columns for laboratory (if applicable), USGS Site Name, date sampled, time sampled, and columns of specific analytical and(or) other data. The .csv data files have the same number of rows and each row in each .csv file corresponds to the same sample. Blank cells in a .csv file indicate that the sample was not analyzed for that constituent. The data release also provides a Data Dictionary (Data_Dictionary.csv) that provides the following information for each constituent (analyte): laboratory or data source, data type, description of units, method, minimum reporting limit, limit of quantitation if appropriate, method reference citations, minimum, maximum, median, and average values for each analyte. The data release also contains a file called Abbreviations in Data_Dictionary.pdf that contains all of the abbreviations in the Data Dictionary and in the well characteristics file in the companion report, Colman and others (2018).

  15. d

    Rainfall, Volumetric Soil-Water Content, Video, and Geophone Data from the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Rainfall, Volumetric Soil-Water Content, Video, and Geophone Data from the Hermits Peak-Calf Canyon Fire Burn Area, New Mexico, June 2022 to June 2024 [Dataset]. https://catalog.data.gov/dataset/rainfall-volumetric-soil-water-content-video-and-geophone-data-from-the-hermits-peak-calf-
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Calf Canyon, Hermit Peak
    Description

    Precipitation, volumetric soil-water content, videos, and geophone data characterizing postfire debris flows were collected at the 2022 Hermit’s Peak Calf-Canyon Fire in New Mexico. This dataset contains data from June 22, 2022, to June 26, 2024. The data were obtained from a station located at 35° 42’ 28.86” N, 105° 27’ 18.03” W (geographic coordinate system). Each data type is described below. Raw Rainfall Data: Rainfall data, Rainfall.csv, are contained in a comma separated value (.csv) file. The data are continuous and sampled at 1-minute intervals. The columns in the csv file are TIMESTAMP(UTC), RainSlowInt (the depth of rain in each minute [mm]), CumRain (cumulative rainfall since the beginning of the record [mm]), and VWC# (volumetric water content [V/V]) at three depths (1 = 10 cm, 2=30 cm, and 3=50 cm). VWC values outside of the range of 0 to 0.5 represent sensor malfunctions and were replaced with -99999 . Storm Record: We summarized the rainfall, volumetric soil-water content, and geophone data based on rainstorms. We defined a storm as rain for a duration >= 5 minutes or with an accumulation > 2.54 mm. Each storm was then assigned a storm ID starting at 0. The storm record data, StormRecord.csv, provides peak rainfall intensities and times and volumetric soil-water content information for each storm. The columns from left to right provide the information as follows: ID, StormStart yyyy-mm-dd hh:mm:ss-tz, StormStop yyyy-mm-dd hh:mm:ss-tz, StormDepth mm, StormDuration h, I-5 mm h-1, I-10 mm h-1, I-15 mm h-1, I-30 mm h-1, I-60 mm h-1, I-5 time yyyy-mm-dd hh:mm:ss-tz, I-10 time yyyy-mm-dd hh:mm:ss-tz, I-15 time yyyy-mm-dd hh:mm:ss-tz] ([UTC], the time of the peak 15-minute rainfall intensity), I-30 time yyyy-mm-dd hh:mm:ss-tz] ] ([UTC], the time of the peak 30-minute rainfall intensity), I-60 time [yyyy-mm-dd hh:mm:ss-tz] [UTC], (the time of the peak 60-minute rainfall intensity), VWC (volumetric water content [V/V] at three depths (1 = 10 cm, 2 = 30 cm, 3 = 50 cm) at the start of the storm, the time of the peak 15-minute rainfall intensity, and the end of the storm), Velocity [m s-1] of the flow, and Event (qualitative observation of type of flow from video footage). VWC values outside of the range of 0 to 0.5 represent sensor malfunctions and were replaced with -99999. Velocity was only calculated for flows with a noticeable surge as the rest of the signal is not sufficient for a cross-correlation, and Event was only filled for storms with quality video data. Values of -99999 were assigned for these columns for all other storms. Geophone Data: Geophone data, GeophoneData.zip, are contained in comma separated value (.csv) files labeled by ‘storm’ and the corresponding storm ID in the storm record and labeled IDa and IDb if the geophone stopped recording for more than an hour during the storm. The data was recorded at two geophones sampled at 50 Hz, one 11.5 m upstream from the station and one 9.75 m downstream from the station. Geophones were triggered to record when 1.6 mm of rain was detected during a period of 10 minutes, and they continued to record for 30 minutes past the last timestamp when this criteria was met. The columns in each csv file are TIMESTAMP [UTC], GeophoneUp_mV (the upstream geophone [mV]), GeophoneDn_mV (the downstream geophone [mV]). Note that there are occasional missed samples when the data logger did not record due to geophone malfunction when data points are 0.04 s or more apart. Videos: The videos stormID_mmdd.mp4 (or .mov) are organized by storm ID where one folder contains data for one storm. Within folders for each storm, videos are labeled by the timestamp in UTC of the end of the video as IMGPhhmm. Some videos in the early mornings or late evenings, or in very intense rainfall, have had brightness and contrast adjustments in Adobe Premiere Pro for better video quality and are in MP4 format. All raw videos are in MOV format. The camera triggered when a minimum of 1.6 mm of rain fell in a 10-minute interval and it recorded in 16-minute video clips until it was 30 minutes since the last trigger. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

  16. Question-Answer combination

    • kaggle.com
    zip
    Updated Jan 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ailurophile (2020). Question-Answer combination [Dataset]. https://www.kaggle.com/veeralakrishna/questionanswer-combination
    Explore at:
    zip(3231267 bytes)Available download formats
    Dataset updated
    Jan 9, 2020
    Authors
    Ailurophile
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Valuelabs ML Hackathon

    About Problem Statement: Genre: NLP - Problem Type: Contextual Semantic Similarity, Auto-generate Text-based answers

    Submission Format: - You need to generate upto 3 distractors for each Question-Answer combination - Each distractor is a string - The 3 distractors/strings need to be separated with a comma (,) - Each value in Results.csv's distractor column will contain the distractors as follows: distractor_for_QnA_1 = "distractor1","distractor2","distractor3"

    About the Evaluation Parameter: - All distractor values for 1 question-answer will be converted into a vector form - 1 vector gets generated for submitted distractors and 1 vector is generated for truth value - cosine_similarity between these 2 vectors is evaluated - Similarly, cosine_similarity gets evaluated for all the question-answer combinations - Score of your submitted prediction file = mean ( cosine_similarity between distractor vectors for each entry in test.csv)

    Common Issues Faced: How to handle them?:

    Download Dataset giving XML error: Try restarting your session after clearing browser cache/cookies and try again. If you still face any issue, please raise a ticket with us.

    Upload Prediction File not working: Ensure you are compliant with the Guidelines and FAQs. You will face this error if you exceed the maximum number of prediction file uploads allowed.

    Exceptions (Incorrect number of Rows / Incorrect Headers / Prediction missing for a key): For this problem statement, we recommend you to update the 'distractor' column in Results.csv with your predictions, following the format explained above Evaluation is getting stuck in a loop : We recommend you to immediately refresh your session and start afresh with a cleared cache. Please ensure your predictions.csv matches the file format Results.csv. Please check that all the above mentioned checks have been conducted. If you still face any issue, please raise a ticket with us.

  17. Synthetic Order Records: 10K to 10M Records

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swain (2025). Synthetic Order Records: 10K to 10M Records [Dataset]. https://www.kaggle.com/datasets/swainproject/synthetic-order-records-10k-to-10m-records
    Explore at:
    zip(322954679 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    Swain
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    About Dataset

    This dataset contains 7 pre-generated CSV files with realistic synthetic order records, ranging from 10,000 to 10,000,000 records. Perfect for e-commerce development, inventory testing, data analysis, and prototyping workflows without privacy concerns.

    What's Included

    The Order dataset includes the following fields:

    FieldDescriptionTypeExampleRange
    order_idUnique sequential order identifierInteger1, 2, 3, ...1 to N
    person_idForeign key to person/customerInteger1, 100, 50001 to N*
    order_dateOrder creation dateString2021-03-15, 2025-11-28YYYY-MM-DD format
    statusCurrent order statusStringPending, Shipped, Delivered7 status values
    total_amountOrder total valueFloat24.99, 599.95, 1499.99$10.00 - $1500.00
    currencyCurrency codeStringUSD, EUR, GBP3 ISO codes
    payment_methodPayment type usedStringcredit_card, paypal, klarna7 payment methods
    shipping_methodDelivery option selectedStringstandard, express, fedex8 shipping methods
    notesAdditional order notesStringPlease leave at door~5% populated

    Please note: the person_id is limited tot the maximum height of order_id. So for the file with 10,000 lines of orders the person_id is a random value between 1-10.000.

    File Sizes

    • 10K records
    • 100K records
    • 500K records
    • 1M records
    • 2M records
    • 5M records
    • 10M records

    Why This Dataset?

    ✓ No privacy concerns—completely synthetic data
    ✓ Perfect for e-commerce and inventory system testing
    ✓ Ideal for ML model training and prototyping
    ✓ Ready-to-use CSV format
    ✓ Multiple sizes for different use cases
    ✓ Realistic pricing, ratings, and product specifications

    Use Cases

    • E-commerce platform development and testing
    • Inventory management system validation
    • Data warehouse and ETL pipeline testing
    • Database performance benchmarking and load testing
    • API integration testing with realistic product data
    • Machine learning and predictive analytics prototyping
    • Data pipeline validation

    License: CC BY 4.0 (Please attribute to Swain / SwainLabs when sharing)

  18. c

    Largest Flipkart Product Listings

    • crawlfeeds.com
    csv, zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Largest Flipkart Product Listings [Dataset]. https://crawlfeeds.com/datasets/flipkart-products-dataset
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Looking to enhance your data-driven projects? Our large Flipkart e-commerce dataset is now available in CSV format, offering a wealth of information across multiple product categories. Whether you need the complete dataset or a specific subset based on categories, we've got you covered.

    What’s Included in the Flipkart Dataset?

    Our dataset is meticulously curated to provide high-quality, reliable data for your e-commerce and AI projects. It includes detailed product information spanning various categories, such as:

    • Automotive Accessories
    • Baby Care Products
    • Mobiles & Accessories
    • Men's Fashion Dataset from Flipkart
    • Home Improvement Items
    • Beauty and Grooming Products
    • Footwear
    • Jewellery
    • Toys and Games
    • Health Care Supplies
    • Kitchen, Cookware & Serveware
    • Computers and Accessories
    • Audio & Video Equipment
      …and many more!

    Sample CSV File for Preview

    A sample CSV file with 200 records is available for download after a quick signup. Use this sample to evaluate the data structure, quality, and relevance to your project requirements.

    Why Choose Our Flipkart Product Dataset?

    • Customizable Subsets: Request a subset of data tailored to specific categories that suit your project needs.
    • Versatile Applications: Perfect for building recommendation engines, price comparison tools, inventory management systems, and market trend analysis.
    • Ease of Access: The dataset is available in CSV format for seamless integration into your workflows.
    • Diverse Categories: Covering everything from fashion and home decor to electronics and festive decor, this dataset offers unmatched variety.

    How to Get the Flipkart Dataset?

    Visit Crawl Feeds Data Request to request access to the complete dataset or a customized subset.

  19. r

    Data from: Cold winters drive consistent and spatially synchronous 8-year...

    • demo.researchdata.se
    • researchdata.se
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sara Emery; Ola Lundin (2022). Data from: Cold winters drive consistent and spatially synchronous 8-year population cycles of cabbage stem flea beetle [Dataset]. https://demo.researchdata.se/en/catalogue/dataset/2022-215-1
    Explore at:
    Dataset updated
    Dec 8, 2022
    Dataset provided by
    Swedish University of Agricultural Sciences
    Authors
    Sara Emery; Ola Lundin
    Time period covered
    Jan 1, 1968 - Dec 31, 2020
    Area covered
    Sweden, Skåne County
    Description

    The data contain information on the number of cabbage stem flea beetle (Psylliodes chrysocephala) larva in winter oilseed rape plants in southern Sweden 1968-2018. A monitoring program for cabbage stem flea beetles in southern Sweden winter oilseed rape fields started in 1969. These data were collected over a 50-year period from commercial winter oilseed rape fields across Scania, the southernmost county in Sweden, by the Swedish University of Agricultural Sciences and its predecessors as well as the Swedish Board of Agriculture. The sampling region of Scania, Sweden was divided into five subregions: (1) southeast, (2) southwest, (3) west, (4) northeast, and (5) northwest. For each subregion we also include data on daily maximum and minimum temperature data in Celsius from 1968-2018. The total area planted to winter oilseed rape in Scania, the mean regional number of cold days and the North Atlantic Oscillation index are reported. Lastly, P. chrysocephala density in winter oilseed rape across five subregions in the UK from 2001-2020 are extracted from a plot in a public report and reported.

    The data in the CSFB_Scania_NAs2010-11.csv file have information on crop planting date, the total number of P. chrysocephala larvae detected, number of plants sampled, the density of larva (total larvae/plants examined), sampling date, year, subregion, whether a seed coating or spray pesticide was used in the field and whether the sample was from a commercial field or from an experiment. 3118 rows.

    Five files have daily maximum and minimum temperatures for each subregion 1968-2018 (NorthwestInsectYear.csv, NortheastInsectYear.csv, SoutheastInsectYear.csv, SouthwestInsectYear.csv, WestInsectYear.csv). All weather data comes from Swedish weather data website https://www.smhi.se/data. 18,566 rows in the NorthwestInsectYear.csv, 18,606 rows in the NortheastInsectYear.csv, 18,574 rows in the SoutheastInsectYear.csv, 18,574 rows in the SouthwestInsectYear.csv, 18,574 rows in the WestInsectYear.csv.

    The file WOSRareaSkane1968-present.csv pulls data from the Swedish Board of Agriculture on the number of hectares planted to winter oilseed rape, turnip rape and the combined total in the Scania region in Sweden from 1968 (matching with harvest year 1969) to 2019. 53 rows.

    The file NAO_coldDays.csv gives the annual Hurrell PC-Based North Atlantic Oscillation Index value from https://climatedataguide.ucar.edu/climate-data/hurrell-north-atlantic-oscillation-nao-index-pc-based as well as the raw and log-transformed regional mean number of cold days (below -10C) per year for Scania. 51 rows.

    Finally, the data in the CSFBinUK2001-2020.csv is the data extracted from “historical comparisons” figure from the Crop Monitor report last accessed 7 November, 2022 at https://www.cropmonitor.co.uk/wosr/surveys/wosrPestAssLab.cfm?year=2006/2007&season=Autumn. These data include the regions in the UK that data was collected from, the harvest year and the mean number of cabbage stem flea beetle larvae counted per plant in each region and year. 101 rows.

  20. 🏆Congrats India! ICC Women's World Cup 2025 Data

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    1! (2025). 🏆Congrats India! ICC Women's World Cup 2025 Data [Dataset]. https://www.kaggle.com/datasets/ibrahimqasimi/feaers-report-id
    Explore at:
    zip(4827 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    1!
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🏏 BREAKING: India Wins Their First Ever World Cup! 🇮🇳

    Just Updated! November 3, 2025 - The Indian Women's Cricket Team has created history by defeating South Africa by 52 runs in the final at Navi Mumbai on November 2, 2025. This comprehensive dataset captures every moment of this historic tournament!

    https://images.icc-cricket.com/image/upload/t_ratio21_9-size40-webp/prd/x9j00shrhapfiiiewl50" alt="India celebrates World Cup victory">

    📊 Dataset Overview

    This is the most complete dataset for the ICC Women's Cricket ODI World Cup 2025 (13th edition), hosted by India and Sri Lanka from September 30 to November 2, 2025. Perfect for data analysis, machine learning, and cricket analytics!

    🔥 Why This Dataset?

    • Most Recent - Updated within 24 hours of tournament conclusion
    • 100% Complete - All 31 matches (28 group stage + 2 semi-finals + 1 final)
    • High Quality - Clean, structured CSV files ready for analysis
    • Historic Significance - First time a team outside Australia/England won
    • Multiple Dimensions - Match data, player stats, team performance, venue analysis

    📁 Files Included (7 CSV Files)

    1. match_results.csv - Complete Match Data

    Every single match with detailed information: - 31 matches from group stage to final - Match date, venue, and match type - Team scores and overs bowled - Toss details (winner and decision) - Match result and winning margin - Player of the Match awards

    Columns: match_id, date, match_type, venue, team1, team2, toss_winner, toss_decision, team1_score, team2_score, winner, margin, player_of_match

    2. points_table.csv - Final League Standings

    Official tournament standings after group stage: - Team positions (1-8) - Matches played, wins, losses, no results - Total points and Net Run Rate (NRR) - Qualification status for semi-finals

    Columns: position, team, matches, wins, losses, no_result, points, net_run_rate, qualification

    3. top_run_scorers.csv - Top 15 Batters

    Tournament's leading run scorers: - Total runs, innings, and matches played - Batting average and strike rate - Highest individual scores - Number of centuries and half-centuries

    🏆 Top Scorer: Laura Wolvaardt (South Africa) - 571 runs (New World Cup Record!)

    Columns: rank, player_name, team, matches, innings, runs, highest_score, average, strike_rate, centuries, half_centuries

    4. top_wicket_takers.csv - Top 15 Bowlers

    Tournament's leading wicket takers: - Total wickets and overs bowled - Bowling average, economy rate, and strike rate - Best bowling figures - Five-wicket hauls

    🏆 Top Wicket Taker: Deepti Sharma (India) - 22 wickets (Player of Tournament)

    Columns: rank, player_name, team, matches, innings, overs, wickets, best_figures, average, economy, strike_rate, five_wicket_hauls

    5. team_statistics.csv - Team Performance Metrics

    Comprehensive team-level statistics: - Overall win/loss records and win percentage - Highest and lowest team totals - Highest successful run chase - Average first and second innings scores

    Columns: team, matches, wins, losses, no_result, win_percentage, highest_total, lowest_total, highest_run_chase, avg_first_innings, avg_second_innings

    6. venue_statistics.csv - Venue Analysis

    Performance breakdown by venue: - Matches played at each stadium - Average scores (batting first vs chasing) - Highest team totals at each venue - Chase success rates

    Venues: Guwahati, Indore, Visakhapatnam, Navi Mumbai (India), Colombo (Sri Lanka)

    Columns: venue, city, country, matches, avg_first_innings, avg_second_innings, highest_total, chases_won, chases_lost

    7. tournament_awards_records.csv - Awards & Records

    All tournament awards and milestone achievements: - Player of the Tournament and Player of the Final - Highest run scorer and wicket taker - Individual records (highest score, best bowling) - Team records (highest total, biggest win) - Tournament milestones and achievements

    Columns: award_category, recipient, team, performance

    🏆 Tournament Highlights

    🎯 Final Result

    India 298/7 (50 overs) defeated South Africa 246 (45.3 overs) by 52 runs

    • Date: November 2, 2025
    • Venue: Dr. DY Patil Sports Academy, Navi Mumbai
    • Player of the Match: Shafali Verma (87 runs & 2/36 bowling)

    Key Performances: - 🏏 Shafali Verma: 87 off 78 balls + 2 wickets - 🎯 Deepti Sharma: 58 runs + 5/39 bowling - 💯 Laura Wolvaardt: 101 off 97 balls (SA captain)

    📈 Tournament Records

    • Most Runs: Laura Wolvaardt (SA) - 571 runs ⭐ New World Cup Record
    • Most Wickets: Deepti Sharma (IND) - 22 wickets
    • Highest Individual Score: Laura Wolvaardt - 169 vs England (Semi-Final)
    • Best Bowling Figures: Alana King (AUS) - 7/20 vs South Africa
    • Highest Team Total: India - 333/5 vs Australia (S...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sagar Bapodara (2021). Parkinson_csv [Dataset]. https://www.kaggle.com/datasets/sagarbapodara/parkinson-csv/discussion
Organization logo

Parkinson_csv

Parkinson Dataset in CSV Format

Explore at:
zip(15986 bytes)Available download formats
Dataset updated
Sep 12, 2021
Authors
Sagar Bapodara
Description

Dataset

This dataset is the CSV Version of the Original Parkison Dataset found at https://www.kaggle.com/nidaguler/parkinsons-data-set

Content

Title: Parkinson's Disease Data Set

Abstract: Oxford Parkinson's Disease Detection Dataset

Source

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

Dataset Info

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around six recordings per patient, the name of the patient is identified in the first column.For further information or to pass on comments, please contact Max Little (littlem '@' robots.ox.ac.uk).

Further details are contained in the following reference -- if you use this dataset, please cite: Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2008), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering (to appear).

Attribute Information:

Matrix column entries (attributes): name - ASCII subject name and recording number MDVP:Fo(Hz) - Average vocal fundamental frequency MDVP:Fhi(Hz) - Maximum vocal fundamental frequency MDVP:Flo(Hz) - Minimum vocal fundamental frequency MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude NHR,HNR - Two measures of ratio of noise to tonal components in the voice status - Health status of the subject (one) - Parkinson's, (zero) - healthy RPDE,D2 - Two nonlinear dynamical complexity measures DFA - Signal fractal scaling exponent spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Citation Request:

If you use this dataset, please cite the following paper: 'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007)

Search
Clear search
Close search
Google apps
Main menu