100+ datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Netflix Movies and TV Shows Dataset Cleaned(excel)

    • kaggle.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Tawri
    Description

    This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

    🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

    🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

    📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

    📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

  3. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  4. d

    Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

    • search.dataone.org
    • borealisdata.ca
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
    Explore at:
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Borealis
    Authors
    Costanzo, Lucia; Jadon, Vivek
    Description

    Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.

  5. Dirty Excel Data

    • kaggle.com
    zip
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiva Vashishtha (2022). Dirty Excel Data [Dataset]. https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data/code
    Explore at:
    zip(13123 bytes)Available download formats
    Dataset updated
    Feb 23, 2022
    Authors
    Shiva Vashishtha
    Description

    Dataset

    This dataset was created by Shiva Vashishtha

    Contents

  6. Project 2:Excel data cleaning & dashboard creation

    • kaggle.com
    zip
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George M122 (2024). Project 2:Excel data cleaning & dashboard creation [Dataset]. https://www.kaggle.com/datasets/georgem122/project-2excel-data-cleaning-and-dashboard-creation
    Explore at:
    zip(185070 bytes)Available download formats
    Dataset updated
    Jun 30, 2024
    Authors
    George M122
    Description

    Dataset

    This dataset was created by George M122

    Contents

  7. Data Cleaning Excel Tutorial

    • kaggle.com
    zip
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Khaled Idris (2023). Data Cleaning Excel Tutorial [Dataset]. https://www.kaggle.com/datasets/mohamedkhaledidris/data-cleaning-excel-tutorial
    Explore at:
    zip(13023 bytes)Available download formats
    Dataset updated
    Jul 22, 2023
    Authors
    Mohamed Khaled Idris
    Description

    Dataset

    This dataset was created by Mohamed Khaled Idris

    Contents

  8. i

    Household Income and Expenditure 2010 - Tuvalu

    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistics Division (2019). Household Income and Expenditure 2010 - Tuvalu [Dataset]. http://catalog.ihsn.org/catalog/3203
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Central Statistics Division
    Time period covered
    2010
    Area covered
    Tuvalu
    Description

    Abstract

    The main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis

    Geographic coverage

    National, including Funafuti and Outer islands

    Analysis unit

    • Household
    • individual

    Universe

    All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.

    For details please refer to Table 1.1 of the Report.

    Sampling deviation

    Only the island of Niulakita was not included in the sampling frame, considered too small.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.

    HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items

    INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer

    DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)

    Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:

    Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.

    Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.

    Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.

    Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.

    Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.

    Cleaning operations

    Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.

    All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.

    The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.

    Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.

    A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.

    Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.

    Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.

    Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with

  9. t

    Data from: Advanced Excel

    • theskilldeck.com
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Advanced Excel [Dataset]. https://theskilldeck.com/blog/
    Explore at:
    Dataset updated
    Nov 5, 2025
    Description

    Managing, cleaning, and analyzing HR data efficiently. Offers flexible, accessible data handling for quick HR analysis.

  10. Cleaned-Data Pakistan's Largest Ecommerce Dataset

    • kaggle.com
    Updated Mar 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    umaraziz97 (2023). Cleaned-Data Pakistan's Largest Ecommerce Dataset [Dataset]. https://www.kaggle.com/datasets/umaraziz97/cleaned-data-pakistans-largest-ecommerce-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    umaraziz97
    Area covered
    Pakistan
    Description

    Pakistan’s largest ecommerce data – Power BI Report

    Dataset Link: pakistan’s_largest_ecommerce_dataset Cleaned Data: Cleaned_Pakistan’s_largest_ecommerce_dataset

    Raw Data:

    Rows: 584525 **Columns: **21

    Process:

    All the raw data transformed and saved in new Excel file Working – Pakistan Largest Ecommerce Dataset

    Processed Data:

    Rows: 582250 Columns: 22 Visualization: Here is the link of Visualization report link: Pakistan-s-largest-ecommerce-data-Power-BI-Data-Visualization-Report

    Conclusion:

    In categories Mobiles & Tables make more money by selling highest no of products and also providing highest amount of discount on products. On the other side Men’s Fashion Category has sell second highest no of products but it can’t generate money with that ratio, may be the prices of individual products is a good reason behind that. And in orders details we experience Mobiles & Tablets have highest no of canceled orders but completed orders are almost same as Men’s Fashion. We have mostly completed orders but have huge no of canceled orders. In payment methods cod has most no of completed order and mostly canceled orders have payment method Easyaxis.

  11. Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  12. n

    Spreadsheet Processing Capabilities

    • nantucketai.com
    csv, xlsx
    Updated Sep 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthropic (2025). Spreadsheet Processing Capabilities [Dataset]. https://www.nantucketai.com/claude-just-changed-how-we-do-spreadsheets-with-its-new-feature/
    Explore at:
    csv, xlsxAvailable download formats
    Dataset updated
    Sep 12, 2025
    Dataset authored and provided by
    Anthropic
    Description

    Types of data processing Claude's Code Interpreter can handle

  13. Bike Buyers - Excel Project

    • kaggle.com
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ann Truong (2023). Bike Buyers - Excel Project [Dataset]. https://www.kaggle.com/datasets/bvanntruong/bike-buyers
    Explore at:
    zip(11866 bytes)Available download formats
    Dataset updated
    Jun 8, 2023
    Authors
    Ann Truong
    Description

    This dataset illustrates customer data from bike sales. It contains information such as Income, Occupation, Age, Commute, Gender, Children, and more. This is fictional data, created and used for data exploration and cleaning.

    The link for the Excel project to download can be found on GitHub here. It includes the raw data, the cleaned data, Pivot Tables, and a dashboard with Pivot Charts and Slicers for interaction. This allows the interactive dashboard to filter by Marital Status, Region, and Education.

    Below is a screenshot of the dashboard for ease. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12904052%2Fcbc9db6fe00f3201c64e4fdb668ce9d1%2FBikeBuyers%20Dashboard%20Image.png?generation=1686186378985936&alt=media" alt="">

  14. s

    Global Household Cleaning Products Market Size, Share, Growth Analysis, By...

    • skyquestt.com
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SkyQuest Technology (2024). Global Household Cleaning Products Market Size, Share, Growth Analysis, By Product(Dishwashing Products, Surface Cleaners), By Distribution Channel(Convenience Stores, Supermarkets/Hypermarkets) - Industry Forecast 2023-2030 [Dataset]. https://www.skyquestt.com/report/household-cleaning-products-market
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    SkyQuest Technology
    License

    https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/

    Time period covered
    2023 - 2030
    Area covered
    Global
    Description

    Global Household Cleaning Products Market size was valued at USD 235.76 billion in 2021 and is poised to grow from USD 246.13 billion in 2022 to USD 362.64 billion by 2030, growing at a CAGR of 4.4% in the forecast period (2023-2030).

  15. b

    All Chhattisgarh Dry Cleaning Shop Database – Verified & Updated Contact...

    • bulkdataprovider.com
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bulk Data Provider (2025). All Chhattisgarh Dry Cleaning Shop Database – Verified & Updated Contact Directory in Excel [Dataset]. https://www.bulkdataprovider.com/items/all-chhattisgarh-dry-cleaning-shop-database-verified-updated-contact-directory-in-excel/1669
    Explore at:
    Dataset updated
    May 7, 2025
    Dataset authored and provided by
    Bulk Data Provider
    Variables measured
    Record count
    Description

    🧾 All Chhattisgarh Dry Cleaning Shop Database – Verified & Updated Contact Directory in ExcelThe All Chhattisgarh Dry Cleaning Shop Database is a detailed, verified, and regularly updated Excel directory of professional dry cleaners and laundry service providers across Chhattisgarh. This powerf...

  16. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Manuscript published in Scientific Data with DOI .

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The `Coauthorship_Networ...

  17. f

    Dataset – Student & Early-Career Survey on Data-Analytics Tool Adoption and...

    • figshare.com
    xlsx
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lev Radman (2025). Dataset – Student & Early-Career Survey on Data-Analytics Tool Adoption and Decision-Making (Uzbekistan, Apr–May 2025) [Dataset]. http://doi.org/10.6084/m9.figshare.29430227.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset provided by
    figshare
    Authors
    Lev Radman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose. This dataset contains anonymised raw responses (n = 55, 31 variables) from a cross-sectional survey investigating factors that influence the adoption of data-analytics tools (Excel/Sheets, Power BI/Tableau, Python notebooks, Google Analytics) among graduate students and early-career professionals in Uzbekistan.Instrument. Items operationalise seven UTAUT/TAM-based constructs: Performance Expectancy, Effort Expectancy, Behavioural Intention, Familiarity & Usage, Task–Technology Fit, Barriers to Adoption, plus Demographics (age, gender, study programme, prior stats courses, work experience). All Likert items use a five-point scale.Collection & cleaning. Data were collected via Google Forms between 02 Apr 2025 and 22 Apr 2025 through university e-mail lists, Telegram study channels, and LinkedIn posts. Five partial records (> 20 % missing) were removed; remaining open-text answers were lower-cased, spell-checked, and stemmed. The file is provided exactly as analysed in the accompanying thesis; no further processing (e.g., recoding) has been performed.File contents. survey_responses.xlsx – one worksheet (“Form Responses 1”) with 55 rows × 31 columns. Column A (“Timestamp”) shows submission time in UTC+5. Variable names follow the original question stems for transparency.Ethics & privacy. All participants gave informed e-consent; no personal identifiers (names, e-mails, IPs) are included. Ethical approval: Silk Road University REC # 2025-DX-012.

  18. S

    EFFICIENCY ANALYSIS OF SANITATION MANAGEMENT BASED ON PARALLEL TWO-STAGE DEA...

    • scidb.cn
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhan jing jing; Fu Yingxiong; Zhang Xin; Chen Weiguo; Xie Qiwei (2023). EFFICIENCY ANALYSIS OF SANITATION MANAGEMENT BASED ON PARALLEL TWO-STAGE DEA [Dataset]. http://doi.org/10.57760/sciencedb.j00206.00004
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Science Data Bank
    Authors
    zhan jing jing; Fu Yingxiong; Zhang Xin; Chen Weiguo; Xie Qiwei
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data includes the data of 30 provinces (municipalities) in China except Xizang, Hong Kong, Macao and Taiwan in the ten years from 2011 to 2020 when the network interactive DEA model is used to evaluate the efficiency of sanitation management. They're stored in an excel spreadsheet called Data, Among them, the indicators include population density (people/square kilometers), urban area (square kilometers), harmless treatment capacity (tons/day), investment in city appearance and environmental sanitation (100 million yuan), employment (10,000 people), harmless treatment plants (seats), special vehicles and equipment for city appearance and environmental sanitation (vehicles), the number of public toilets (seats), harmless treatment capacity (10,000 tons), domestic waste clearance volume (10,000 tons), and roads Road cleaning area (10,000 square meters). It also contains the data of influencing factors in the spatial econometric analysis of efficiency, which are saved in the excel table named "Efficiency Influencing Factors", in which the indicators include the level of urban spatial aggregation, the level of urban economic development, the level of urban opening to the outside world, the urban environmental carrying capacity, the level of urban scientific and technological development, the urban industrial structure and the level of urban ecological construction. In addition, this data contains the efficiency values obtained from the analysis (excel table named "Efficiency") and the relevant network DEA code.

  19. d

    Data from: Functional morphology and efficiency of the antenna cleaner in...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 26, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Hackmann; Henry Delacave; Adam Robinson; David Labonte; Walter Federle (2015). Functional morphology and efficiency of the antenna cleaner in Camponotus rufifemur ants [Dataset]. http://doi.org/10.5061/dryad.88q18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2015
    Dataset provided by
    Dryad
    Authors
    Alexander Hackmann; Henry Delacave; Adam Robinson; David Labonte; Walter Federle
    Time period covered
    Jun 25, 2015
    Area covered
    Cambridge, UK
    Description

    Data for manuscript “Functional morphology and efficiency of the antenna cleaner in Camponotus rufifemur ants"Excel file includes 3 data sheets. One sheet for each experiment. The corresponding figures from the manuscript are mentioned above the actual data.Manuscript data.xlsx

  20. r

    Respiration_chambers/raw_log_files and combined datasets of biomass and...

    • researchdata.edu.au
    Updated Jun 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Ocean Data Network (2025). Respiration_chambers/raw_log_files and combined datasets of biomass and chamber data, and physical parameters [Dataset]. https://researchdata.edu.au/respirationchambersrawlogfiles-combined-datasets-physical-parameters/3718192
    Explore at:
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    data.gov.au
    Authors
    Australian Ocean Data Network
    Area covered
    Description

    General overview The following datasets are described by this metadata record, and are available for download from the provided URL.

    Raw log files, physical parameters raw log files Raw excel files, respiration/PAM chamber raw excel spreadsheets Processed and cleaned excel files, respiration chamber biomass data Raw rapid light curve excel files (this is duplicated from Raw log files), combined dataset pH, temperature, oxygen, salinity, velocity for experiment Associated R script file for pump cycles of respirations chambers

    Physical parameters raw log files Raw log files 1) DATE= 2) Time= UTC+11 3) PROG=Automated program to control sensors and collect data 4) BAT=Amount of battery remaining 5) STEP=check aquation manual 6) SPIES=check aquation manual 7) PAR=Photoactive radiation 8) Levels=check aquation manual 9) Pumps= program for pumps 10) WQM=check aquation manual

    Respiration/PAM chamber raw excel spreadsheets Abbreviations in headers of datasets Note: Two data sets are provided in different formats. Raw and cleaned (adj). These are the same data with the PAR column moved over to PAR.all for analysis. All headers are the same. The cleaned (adj) dataframe will work with the R syntax below, alternative add code to do cleaning in R.
    Date: ISO 1986 - Check Time:UTC+11 unless otherwise stated DATETIME: UTC+11 unless otherwise stated ID (of instrument in respiration chambers) ID43=Pulse amplitude fluoresence measurement of control ID44=Pulse amplitude fluoresence measurement of acidified chamber ID=1 Dissolved oxygen ID=2 Dissolved oxygen ID3= PAR ID4= PAR PAR=Photo active radiation umols F0=minimal florescence from PAM Fm=Maximum fluorescence from PAM Yield=(F0 – Fm)/Fm rChl=an estimate of chlorophyll (Note this is uncalibrated and is an estimate only) Temp=Temperature degrees C PAR=Photo active radiation PAR2= Photo active radiation2 DO=Dissolved oxygen %Sat= Saturation of dissolved oxygen Notes=This is the program of the underwater submersible logger with the following abreviations: Notes-1) PAM= Notes-2) PAM=Gain level set (see aquation manual for more detail) Notes-3) Acclimatisation= Program of slowly introducing treatment water into chamber Notes-4) Shutter start up 2 sensors+sample…= Shutter PAMs automatic set up procedure (see aquation manual) Notes-5) Yield step 2=PAM yield measurement and calculation of control Notes-6) Yield step 5= PAM yield measurement and calculation of acidified Notes-7) Abatus respiration DO and PAR step 1= Program to measure dissolved oxygen and PAR (see aquation manual). Steps 1-4 are different stages of this program including pump cycles, DO and PAR measurements. 8) Rapid light curve data Pre LC: A yield measurement prior to the following measurement After 10.0 sec at 0.5% to 8%: Level of each of the 8 steps of the rapid light curve Odessey PAR (only in some deployments): An extra measure of PAR (umols) using an Odessey data logger Dataflow PAR: An extra measure of PAR (umols) using a Dataflow sensor. PAM PAR: This is copied from the PAR or PAR2 column PAR all: This is the complete PAR file and should be used Deployment: Identifying which deployment the data came from

    Respiration chamber biomass data The data is chlorophyll a biomass from cores from the respiration chambers. The headers are: Depth (mm) Treat (Acidified or control) Chl a (pigment and indicator of biomass) Core (5 cores were collected from each chamber, three were analysed for chl a), these are psudoreplicates/subsamples from the chambers and should not be treated as replicates.

    Associated R script file for pump cycles of respirations chambers Associated respiration chamber data to determine the times when respiration chamber pumps delivered treatment water to chambers. Determined from Aquation log files (see associated files). Use the chamber cut times to determine net production rates. Note: Users need to avoid the times when the respiration chambers are delivering water as this will give incorrect results. The headers that get used in the attached/associated R file are start regression and end regression. The remaining headers are not used unless called for in the associated R script. The last columns of these datasets (intercept, ElapsedTimeMincoef) are determined from the linear regressions described below.
    To determine the rate of change of net production, coefficients of the regression of oxygen consumption in discrete 180 minute data blocks were determined. R squared values for fitted regressions of these coefficients were consistently high (greater than 0.9). We make two assumptions with calculation of net production rates: the first is that heterotrophic community members do not change their metabolism under OA; and the second is that the heterotrophic communities are similar between treatments.

    Combined dataset pH, temperature, oxygen, salinity, velocity for experiment This data is rapid light curve data generated from a Shutter PAM fluorimeter. There are eight steps in each rapid light curve. Note: The software component of the Shutter PAM fluorimeter for sensor 44 appeared to be damaged and would not cycle through the PAR cycles. Therefore the rapid light curves and recovery curves should only be used for the control chambers (sensor ID43).
    The headers are PAR: Photoactive radiation relETR: F0/Fm x PAR Notes: Stage/step of light curve Treatment: Acidified or control The associated light treatments in each stage. Each actinic light intensity is held for 10 seconds, then a saturating pulse is taken (see PAM methods).
    After 10.0 sec at 0.5% = 1 umols PAR After 10.0 sec at 0.7% = 1 umols PAR After 10.0 sec at 1.1% = 0.96 umols PAR After 10.0 sec at 1.6% = 4.32 umols PAR After 10.0 sec at 2.4% = 4.32 umols PAR After 10.0 sec at 3.6% = 8.31 umols PAR After 10.0 sec at 5.3% =15.78 umols PAR After 10.0 sec at 8.0% = 25.75 umols PAR This dataset appears to be missing data, note D5 rows potentially not useable information See the word document in the download file for more information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
167 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu