72 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Netflix Movies and TV Shows Dataset Cleaned(excel)

    • kaggle.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Tawri (2025). Netflix Movies and TV Shows Dataset Cleaned(excel) [Dataset]. https://www.kaggle.com/datasets/gauravtawri/netflix-movies-and-tv-shows-dataset-cleanedexcel
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Tawri
    Description

    This dataset is a cleaned and preprocessed version of the original Netflix Movies and TV Shows dataset available on Kaggle. All cleaning was done using Microsoft Excel — no programming involved.

    🎯 What’s Included: - Cleaned Excel file (standardized columns, proper date format, removed duplicates/missing values) - A separate "formulas_used.txt" file listing all Excel formulas used during cleaning (e.g., TRIM, CLEAN, DATE, SUBSTITUTE, TEXTJOIN, etc.) - Columns like 'date_added' have been properly formatted into DMY structure - Multi-valued columns like 'listed_in' are split for better analysis - Null values replaced with “Unknown” for clarity - Duration field broken into numeric + unit components

    🔍 Dataset Purpose: Ideal for beginners and analysts who want to: - Practice data cleaning in Excel - Explore Netflix content trends - Analyze content by type, country, genre, or date added

    📁 Original Dataset Credit: The base version was originally published by Shivam Bansal on Kaggle: https://www.kaggle.com/shivamb/netflix-shows

    📌 Bonus: You can find a step-by-step cleaning guide and the same dataset on GitHub as well — along with screenshots and formulas documentation.

  3. d

    Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

    • search.dataone.org
    • borealisdata.ca
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
    Explore at:
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Borealis
    Authors
    Costanzo, Lucia; Jadon, Vivek
    Description

    Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.

  4. Retail Store Sales: Dirty for Data Cleaning

    • kaggle.com
    zip
    Updated Jan 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning
    Explore at:
    zip(226740 bytes)Available download formats
    Dataset updated
    Jan 18, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Retail Store Sales Dataset

    Overview

    The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

    File Information

    • File Name: retail_store_sales.csv
    • Number of Rows: 12,575
    • Number of Columns: 11

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    Customer IDA unique identifier for each customer. 25 unique customers.CUST_01
    CategoryThe category of the purchased item.Food, Furniture
    ItemThe name of the purchased item. May contain missing values or None.Item_1_FOOD, None
    Price Per UnitThe static price of a single unit of the item. May contain missing or None values.4.00, None
    QuantityThe quantity of the item purchased. May contain missing or None values.1, None
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, None
    Payment MethodThe method of payment used. May contain missing or invalid values.Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Online
    Transaction DateThe date of the transaction. Always present and valid.2023-01-15
    Discount AppliedIndicates if a discount was applied to the transaction. May contain missing values.True, False, None

    Categories and Items

    The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

    Electric Household Essentials

    Item CodeItem NamePrice
    Item_1_EHEBlender5.0
    Item_2_EHEMicrowave6.5
    Item_3_EHEToaster8.0
    Item_4_EHEVacuum Cleaner9.5
    Item_5_EHEAir Purifier11.0
    Item_6_EHEElectric Kettle12.5
    Item_7_EHERice Cooker14.0
    Item_8_EHEIron15.5
    Item_9_EHECeiling Fan17.0
    Item_10_EHETable Fan18.5
    Item_11_EHEHair Dryer20.0
    Item_12_EHEHeater21.5
    Item_13_EHEHumidifier23.0
    Item_14_EHEDehumidifier24.5
    Item_15_EHECoffee Maker26.0
    Item_16_EHEPortable AC27.5
    Item_17_EHEElectric Stove29.0
    Item_18_EHEPressure Cooker30.5
    Item_19_EHEInduction Cooktop32.0
    Item_20_EHEWater Dispenser33.5
    Item_21_EHEHand Blender35.0
    Item_22_EHEMixer Grinder36.5
    Item_23_EHESandwich Maker38.0
    Item_24_EHEAir Fryer39.5
    Item_25_EHEJuicer41.0

    Furniture

    Item CodeItem NamePrice
    Item_1_FUROffice Chair5.0
    Item_2_FURSofa6.5
    Item_3_FURCoffee Table8.0
    Item_4_FURDining Table9.5
    Item_5_FURBookshelf11.0
    Item_6_FURBed F...
  5. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  6. u

    Data from: Survey data from the Australian Marine Debris Initiative

    • research.usc.edu.au
    • researchdata.edu.au
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heidi Tait; Jodi Jones; Caitlin Smith; Kathy Townsend, Survey data from the Australian Marine Debris Initiative [Dataset]. https://research.usc.edu.au/esploro/outputs/dataset/Survey-data-from-the-Australian-Marine/991016398702621
    Explore at:
    csv(7054018 bytes)Available download formats
    Dataset provided by
    University of the Sunshine Coast
    Authors
    Heidi Tait; Jodi Jones; Caitlin Smith; Kathy Townsend
    Time period covered
    2024
    Description

    Survey data from the Australian Marine Debris Initiative and the result of spatial analysis from multiple creative commons datasets. Data consists of: • Spatial Data Queensland Coastline – Event summaries within an Excel data table and shapefile • All years • Number of Items removed, Weight volunteers, Volume, Distance, Latitude and Longitude. • Contributing organisation files table/ sites • Environmental, physical and biological variables associated with the closest catchment to each debris survey. TBF has made all reasonable efforts to ensure that the information in the Custom Dataset is accurate. TBF will not be held responsible: • for the way these data are used by the Entity for their Reports; • for any errors that may be contained in the Custom Dataset; or • any direct or indirect damage the use of the Custom Dataset may cause. Data collected by TBF comes from citizen science initiatives and is taken at face value from contributors with each entry being vetted and periodic checks being made to maintain the integrity of the overall dataset. Some clean-up data has been extrapolated by data collectors. Some weight and distance details have not been provided by contributors. The data was collected by various organisations and individuals in clean-up events at their chosen locations where man-made items greater than 5mm were removed from the beach, and sorted, counted and recorded on data sheets, using CyberTracker software devices or the AMDI mobile application. Items were identified according to the method laid out in the TBF Marine Debris Identification Manual in which items are grouped according to their material categories (the manual is available on the TBF website). The length of beach cleaned is at the discretion of the clean-up group and the total weight of items removed is either weighed with handheld scales or estimated.

  7. d

    Data from: Designing data science workshops for data-intensive environmental...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Dryad
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    Time period covered
    Nov 14, 2020
    Description

    Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, r...
    
  8. h

    Sentiment_Analysis

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinesh Kumar (2024). Sentiment_Analysis [Dataset]. https://huggingface.co/datasets/DineshKumar1329/Sentiment_Analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    Dinesh Kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Collection

    The sentiment data used in this project was collected manually. The dataset, stored in an Excel file (Sentiment.xlsx), includes text samples with corresponding sentiment labels. Details about the manual data collection process, sources, and criteria are explained in the data collection section of the code.

      Data Cleaning
    

    Before training the sentiment analysis model, the collected data undergoes a cleaning process. The clean_text function is applied to each… See the full description on the dataset page: https://huggingface.co/datasets/DineshKumar1329/Sentiment_Analysis.

  9. Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  10. a

    Global Airline Routes

    • gis-for-secondary-schools-schools-be.hub.arcgis.com
    • fesec-cesj.opendata.arcgis.com
    • +1more
    Updated May 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ArcGIS StoryMaps (2018). Global Airline Routes [Dataset]. https://gis-for-secondary-schools-schools-be.hub.arcgis.com/datasets/Story::global-airline-routes
    Explore at:
    Dataset updated
    May 30, 2018
    Dataset authored and provided by
    ArcGIS StoryMaps
    Area covered
    Description

    This layer visualizes over 60,000 commercial flight paths. The data was obtained from openflights.org, and was last updated in June 2014. The site states, "The third-party that OpenFlights uses for route data ceased providing updates in June 2014. The current data is of historical value only. As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67,663 routes between 3,321 airports on 548 airlines spanning the globe. Creating and maintaining this database has required and continues to require an immense amount of work. We need your support to keep this database up-to-date."To donate, visit the site and click the PayPal link.Routes were created using the XY-to-line tool in ArcGIS Pro, inspired by Kenneth Field's work, and following a modified methodology from Michael Markieta (www.spatialanalysis.ca/2011/global-connectivity-mapping-out-flight-routes).Some cleanup was required in the original data, including adding missing location data for several airports and some missing IATA codes. Before performing the point to line conversion, the key to preserving attributes in the original data is a combination of the INDEX and MATCH functions in Microsoft Excel. Example function: =INDEX(Airlines!$B$2:$B$6200,MATCH(Routes!$A2,Airlines!$D$2:Airlines!$D$6200,0))                                                

  11. f

    An Archive of #DH2016 Tweets Published from Sunday 10 to Tuesday 12 July...

    • city.figshare.com
    bin
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ernesto Priego (2023). An Archive of #DH2016 Tweets Published from Sunday 10 to Tuesday 12 July 2016 GMT [Dataset]. http://doi.org/10.6084/m9.figshare.3484817.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    City, University of London
    Authors
    Ernesto Priego
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    eBackgroundThe Digital Humanities 2016 conference is taking/took place in Kraków, Poland, between Sunday 11 July and Saturday 16 July 2016. #DH2016 is/was the conference official hashtag.What This Output IsThis is an Excel spreadsheet file containing three sheets containing a total of 3478 Tweets publicly published with the hashtag #DH2016. The archive starts with a Tweet published on Sunday July 10 2016 00:03:41 +0000 and finishes with a Tweet published on Tuesday July 12 2016 23:55:47 +0000.The original collection has been organised into conference days; one sheet per day (GMT and Central European Times included). A breakdown of Tweets per day:Sunday 10 July 2016: 179 TweetsMonday 11 July 2016: 981 TweetsTuesday 12 July 2016: 2318 Tweets Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using Martin Hawksey's TAGS 6.0. Only users with at least 1 follower were included in the archive. Retweets have been included (Retweets count as Tweets). The collection spreadsheet was customised to reflect the time zone and geographical location of the conference.The profile_image_url and entities_str metadata were removed before public sharing in this archive. Please bear in mind that the conference hashtag has been spammed so some Tweets colllected may be from spam accounts. Some automated refining has been performed to remove Tweets not related to the conference but the data is likely to require further refining and deduplication. Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might "over-represent the more central users", not offering "an accurate picture of peripheral activity" (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet tagged with #dh2016 during the indicated period, and the dataset is shared for archival, comparative and indicative educational research purposes only.Only content from public accounts is included and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.Each Tweet and its contents were published openly on the Web with the queried hashtag and are responsibility of the original authors.No private personal information is shared in this dataset. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road. This dataset is shared to archive, document and encourage open educational research into scholarly activity on Twitter. Other ConsiderationsTweets published publicly by scholars during academic conferences are often tagged (labeled) with a hashtag dedicated to the conference in question.The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. Though every reason for Tweeters' use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter's Privacy and data sharing policies. Professional associations like the Modern Language Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter's search API has well-known temporal limitations for retrospective historical search and collection.Beyond individual tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. To date, collecting in real time is the only relatively accurate method to archive tweets at a small scale. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time.The CC-BY license has been applied to the output in the repository as a curated dataset. Authorial/curatorial/collection work has been performed on the file in order to make it available as part of the scholarly record. The data contained in the deposited file is otherwise freely available elsewhere through different methods and anyone not wishing to attribute the data to the creator of this output is needless to say free to do their own collection and clean their own data.

  12. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. http://doi.org/10.5061/dryad.wpzgmsbwj
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Manuscript published in Scientific Data with DOI .

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The `Coauthorship_Networ...

  13. d

    City of Tempe 2023 Business Survey Data

    • catalog.data.gov
    • s.cnmilf.com
    • +12more
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tempe (2024). City of Tempe 2023 Business Survey Data [Dataset]. https://catalog.data.gov/dataset/city-of-tempe-2023-business-survey-data
    Explore at:
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    City of Tempe
    Area covered
    Tempe
    Description

    These data include the individual responses for the City of Tempe Annual Business Survey conducted by ETC Institute. These data help determine priorities for the community as part of the City's on-going strategic planning process. Averaged Business Survey results are used as indicators for city performance measures. The performance measures with indicators from the Business Survey include the following (as of 2023):1. Financial Stability and Vitality5.01 Quality of Business ServicesThe location data in this dataset is generalized to the block level to protect privacy. This means that only the first two digits of an address are used to map the location. When they data are shared with the city only the latitude/longitude of the block level address points are provided. This results in points that overlap. In order to better visualize the data, overlapping points were randomly dispersed to remove overlap. The result of these two adjustments ensure that they are not related to a specific address, but are still close enough to allow insights about service delivery in different areas of the city.Additional InformationSource: Business SurveyContact (author): Adam SamuelsContact E-Mail (author): Adam_Samuels@tempe.govContact (maintainer): Contact E-Mail (maintainer): Data Source Type: Excel tablePreparation Method: Data received from vendor after report is completedPublish Frequency: AnnualPublish Method: ManualData DictionaryMethods:The survey is mailed to a random sample of businesses in the City of Tempe. Follow up emails and texts are also sent to encourage participation. A link to the survey is provided with each communication. To prevent people who do not live in Tempe or who were not selected as part of the random sample from completing the survey, everyone who completed the survey was required to provide their address. These addresses were then matched to those used for the random representative sample. If the respondent’s address did not match, the response was not used.To better understand how services are being delivered across the city, individual results were mapped to determine overall distribution across the city.Processing and Limitations:The location data in this dataset is generalized to the block level to protect privacy. This means that only the first two digits of an address are used to map the location. When they data are shared with the city only the latitude/longitude of the block level address points are provided. This results in points that overlap. In order to better visualize the data, overlapping points were randomly dispersed to remove overlap. The result of these two adjustments ensure that they are not related to a specific address, but are still close enough to allow insights about service delivery in different areas of the city.The data are used by the ETC Institute in the final published PDF report.

  14. RAAAP-2 Datasets (17 linked datasets)

    • figshare.com
    bin
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Kerridge; Patrice Ajai-Ajagbe; Cindy Kiel; Jennifer Shambrook; BRYONY WAKEFIELD (2023). RAAAP-2 Datasets (17 linked datasets) [Dataset]. http://doi.org/10.6084/m9.figshare.18972935.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Simon Kerridge; Patrice Ajai-Ajagbe; Cindy Kiel; Jennifer Shambrook; BRYONY WAKEFIELD
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This collection contains the 17 anonymised datasets from the RAAAP-2 international survey of research management and administration professional undertaken in 2019. To preserve anonymity the data are presented in 17 datasets linked only by AnalysisRegionofEmployment, as many of the textual responses, even though redacted to remove institutional affiliation could be used to identify some individuals if linked to the other data. Each dataset is presented in the original SPSS format, suitable for further analyses, as well as an Excel equivalent for ease of viewing. There are additional files in this collection showing the the questionnaire and the mappings to the datasets together with the SPSS scripts used to produce the datasets. These data follow on from, but re not directly linked to the first RAAAP survey undertaken in 2016, data from which can also be found in FigShare Errata (16/5/23) an error in v13 of the main Data Cleansing syntax file (now updated to v14) meant that two variables were missing their value labels (the underlying codes were correct) - a new version (SPSS & Excel) of the Main Dataset has been updated

  15. m

    Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High...

    • data.mendeley.com
    Updated Nov 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alvaro Patiño-Forero (2024). Dataset Supporting: Dynamic Modeling of Poultry Litter Composting in High Mountain Climates using System Identification Techniques [Dataset]. http://doi.org/10.17632/dgxxj2pk8s.2
    Explore at:
    Dataset updated
    Nov 22, 2024
    Authors
    Alvaro Patiño-Forero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Description: This dataset contains detailed measurements collected during two controlled experiments designed to study the dynamics of the composting process using the forced aeration technique. The dataset is divided into two main parts:

    Experiment 1: Includes parameters such as temperatures (hot air and compost pile), relative humidity, and air and heat inputs. Experiment 2: Complements the first experiment with oxygen levels in addition to the previously mentioned variables. Both datasets are organized in a chronological format, with records that allow the analysis of trends and correlations among the studied variables.

    Purpose: The primary objective of this dataset is to facilitate the study of composting dynamics in high mountain environments using the forced aeration technique. It can be used for:

    Bioprocess modeling. Studies on energy optimization in biological and chemical processes. Research in environmental biology, process engineering, and clean technologies. Dataset Features: Total Size: Experiment 1: 4302 records and 8 variables. Experiment 2: 3076 records and 9 variables. Temporal Coverage: Records are organized by hour and minute over several days of experimentation. Key Variables: Hour and minute of the record. Heater and compost pile temperatures. Relative humidity. Air and heat inputs. Oxygen levels (in Experiment 2). Days elapsed since the start of the experiment. Available Formats: The dataset is available in Excel format (.xlsx), with each experiment documented on separate sheets.

    Access and Use: Restrictions: Commercial use of the dataset requires prior authorization. Potential Applications: This resource is valuable for researchers in fields such as:

    Environmental engineering and bioprocesses. Design and optimization of thermal and environmental control systems.

  16. u

    University of Cape Town Student Admissions Data 2006-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2006 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

    The dataset was separated into the following data files:

    1. Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.
    2. Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.
    3. Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

    Analysis unit

    Applications, individuals

    Kind of data

    Administrative records [adm]

    Mode of data collection

    Other [oth]

    Cleaning operations

    The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

  17. Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022

    • datasets.ai
    • catalog.data.gov
    57
    Updated Jul 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2025). Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022 [Dataset]. https://datasets.ai/datasets/data-set-st-louis-river-watershed-mn-conductivity-assessment-march-2022
    Explore at:
    57Available download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Authors
    U.S. Environmental Protection Agency
    Area covered
    Saint Louis River, Minnesota
    Description

    Data used to evaluate potential downstream impacts of the NorthMet Mine, by USEPA Office of Research and Development is providing, for USEPA Region 5’s use, including a characterization of stream specific conductivity (SC) levels, least disturbed background SC, and SC levels that may exceed the Fond du Lac Band’s WQ standards and adversely affect aquatic life, including brook trout (Salvelinus fontinalis), lake sturgeon (Acipenser fulvescens), and benthic macroinvertebrates. Keywords: Conductivity, St. Louis River, benthic invertebrates; mining The attached Excel Pedigree includes: _Datasets: Data file uploaded to EPA Science Hub and/or Environmental Data Set Gateway _R : Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 5 memo 20220325 R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions. The "_R" folder contains four subfolders. Each subfolder has several R scripts, input and output files, and an R project file. Users can run R scripts directly from each subfolder by installing R, RStudio, and associated R packages. Data Dictionary: See tab DataDictionary in Excel file Datasets: Simplified language is used in the text to identify parent data sets. Source and File names are retained in this pedigree in original form to enable R-scripts to retain functionality. • Thingvold et al. (1975-1977) • Griffith (1998-2009) • Predicted background (2000-2015) • Water Quality Portal (1996-2021) • Water Quality Portal Less Disturbed (1996-2021) • Minnesota Pollution Control Agency (MPCA) (1996-2013) • Mid-Atlantic Highlands (1990 to 2014).

    This dataset is associated with the following publication: Cormier, S., and Y. Wang. Appendix C: ORD Specific Conductance Memo, from Susan Cormier to Tera Fong. March 15, 2022. Assessment of effects of increased ion concentrations in the St. Louis River Watershed with special attention to potential mining influence and the jurisdiction of the Fond du Lac Band of Lake Superior Chippewa. U.S. Environmental Protection Agency, Washington, DC, USA, 2022.

  18. S

    Supporting dataset of the aritcle :Underneath Social Media Texts: Sentiment...

    • scidb.cn
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bingyao Jia; Meifang Xie; Jing Wu; Junyi Zhao (2024). Supporting dataset of the aritcle :Underneath Social Media Texts: Sentiment Responses to Public Health Emergency During 2022 COVID-19 Pandemic in China [Dataset]. http://doi.org/10.57760/sciencedb.16527
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Bingyao Jia; Meifang Xie; Jing Wu; Junyi Zhao
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    This dataset is the supporting data for the paper Underneath Social Media Texts: Sentiment Responses to Public Health Emergency During 2022 COVID-19 Pandemic in China.This dataset is mainly used to analyze the data of weibo text and perform sentiment analysis. The data were obtained from Weibo, and the texts were crawled using a Python tool: Weibo crawler tool. The data contains time, text content, user address, etc. Subsequently, Cleaned weibo data was obtained after cleaning operation in Excel. According to the improved Chinese sentiment lexicon, the sentiment analysis tool was used to analyze the text for sentiment analysis, to derive the main sentiment and sentiment scores, and the result file is Sentiment analysis results. Finally, ADF and KPSS analysis tools were used to analyze the stability of sentiment scores in different cities.The weibo text and sentiment analysis results data in the dataset are in .xlsx format, and the rest of the tools are Python code.Crawled data is limited by time, specific search terms and other restrictions, different operation time and terms may lead to differences in the data.

  19. Coffee Shop Sales Analysis

    • kaggle.com
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monis Amir (2024). Coffee Shop Sales Analysis [Dataset]. https://www.kaggle.com/datasets/monisamir/coffee-shop-sales-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    Kaggle
    Authors
    Monis Amir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Analyzing Coffee Shop Sales: Excel Insights 📈

    In my first Data Analytics Project, I Discover the secrets of a fictional coffee shop's success with my data-driven analysis. By Analyzing a 5-sheet Excel dataset, I've uncovered valuable sales trends, customer preferences, and insights that can guide future business decisions. 📊☕

    DATA CLEANING 🧹

    • REMOVED DUPLICATES OR IRRELEVANT ENTRIES: Thoroughly eliminated duplicate records and irrelevant data to refine the dataset for analysis.

    • FIXED STRUCTURAL ERRORS: Rectified any inconsistencies or structural issues within the data to ensure uniformity and accuracy.

    • CHECKED FOR DATA CONSISTENCY: Verified the integrity and coherence of the dataset by identifying and resolving any inconsistencies or discrepancies.

    DATA MANIPULATION 🛠️

    • UTILIZED LOOKUPS: Used Excel's lookup functions for efficient data retrieval and analysis.

    • IMPLEMENTED INDEX MATCH: Leveraged the Index Match function to perform advanced data searches and matches.

    • APPLIED SUMIFS FUNCTIONS: Utilized SumIFs to calculate totals based on specified criteria.

    • CALCULATED PROFITS: Used relevant formulas and techniques to determine profit margins and insights from the data.

    PIVOTING THE DATA 𝄜

    • CREATED PIVOT TABLES: Utilized Excel's PivotTable feature to pivot the data for in-depth analysis.

    • FILTERED DATA: Utilized pivot tables to filter and analyze specific subsets of data, enabling focused insights. Specially used in “PEAK HOURS” and “TOP 3 PRODUCTS” charts.

    VISUALIZATION 📊

    • KEY INSIGHTS: Unveiled the grand total sales revenue while also analyzing the average bill per person, offering comprehensive insights into the coffee shop's performance and customer spending habits.

    • SALES TREND ANALYSIS: Used Line chart to compute total sales across various time intervals, revealing valuable insights into evolving sales trends.

    • PEAK HOUR ANALYSIS: Leveraged Clustered Column chart to identify peak sales hours, shedding light on optimal operating times and potential staffing needs.

    • TOP 3 PRODUCTS IDENTIFICATION: Utilized Clustered Bar chart to determine the top three coffee types, facilitating strategic decisions regarding inventory management and marketing focus.

    *I also used a Timeline to visualize chronological data trends and identify key patterns over specific times.

    While it's a significant milestone for me, I recognize that there's always room for growth and improvement. Your feedback and insights are invaluable to me as I continue to refine my skills and tackle future projects. I'm eager to hear your thoughts and suggestions on how I can make my next endeavor even more impactful and insightful.

    THANKS TO: WsCube Tech Mo Chen Alex Freberg

    TOOLS USED: Microsoft Excel

    DataAnalytics #DataAnalyst #ExcelProject #DataVisualization #BusinessIntelligence #SalesAnalysis #DataAnalysis #DataDrivenDecisions

  20. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
167 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu