100+ datasets found
  1. (🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset

    • kaggle.com
    zip
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
    Explore at:
    zip(18174367560 bytes)Available download formats
    Dataset updated
    Apr 2, 2024
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Ukraine
    Description

    IMPORTANT (02-Apr-2024)

    Kaggle has fixed the issue with gzip files and Version 510 should now reflect properly working files

    IMPORTANT (28-Mar-2024)

    Please use the version 508 of the dataset, as 509 is broken. See link below of the dataset that is properly working https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/versions/508

    Context

    The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.

    Announcement

    [Jun 16] (🌇Sunset) Twitter has finally pulled the plug on all of my remaining TWITTER API accounts as part of their efforts for developers to migrate to the new API. The last tweets that I pulled was dated last Jun 14, and no more data from Jun 15 onwards. It was fun til it lasted and I hope that this dataset was able and will continue to help a lot. I'll just leave the dataset here for future download and reference. Thank you all!

    [Apr 19] Two additional developer accounts have been permanently suspended, expect a lower throughtput in the next few weeks. I will pull data til they ban my last account.

    [Apr 08] I woke up this morning and saw that Twitter has banned/ permanently suspended 4 of my developer accounts, I have around a few more but it is just a matter of time till all my accounts will most likely get banned as well. This was a fun project that I maintained for as long as I can. I will pull data til my last account gets banned.

    [Feb 26] I've started to pull in RETWEETS again, so I am expecting a significant amount of throughput in tweets again on top of the dedicated processes that I have that gets NONRETWEETS. If you don't want RETWEETS, just filter them out.

    [Feb 24] It's been a year since I started getting tweets of this conflict and had no idea that a year later this is still ongoing. Almost everyone assumed that Ukraine will crumble in a matter of days, but it is not the case. To those who have been using my dataset, i hope that I am helping all of you in one way or another. Ill do my best to maintain updating this dataset as long as I can.

    [Feb 02] I seem to be getting less tweets as my crawlers are getting throttled, i used to get 2500 tweets per 15 mins but around 2-3 of my crawlers are getting throttling limit errors. There may be some kind of update that Twitter has done about rate limits or something similar. Will try to find ways to increase the throughput again.

    [Jan 02] For all new datasets, it will now be prefixed by a year, so for Jan 01, 2023, it will be 20230101_XXXX.

    [Dec 28] For those looking for a cleaned version of my dataset, with the retweets removed from before Aug 08, here is a dataset by @@vbmokin https://www.kaggle.com/datasets/vbmokin/russian-invasion-ukraine-without-retweets

    [Nov 19] I noticed that one of my developer accounts, which ISNT TWEETING ANYTHING and just pulling data out of twitter has been permanently banned by Twitter.com, thus the decrease of unique tweets. I will try to come up with a solution to increase my throughput and signup for a new developer account.

    [Oct 19] I just noticed that this dataset is finally "GOLD", after roughly seven months since I first uploaded my gzipped csv files.

    [Oct 11] Sudden spike in number of tweets revolving around most recent development(s) about the Kerch Bridge explosion and the response from Russia.

    [Aug 19- IMPORTANT] I raised the missing dataset issue to Kaggle team and they confirmed it was a bug brought by a ReactJs upgrade, the conversation and details can be seen here https://www.kaggle.com/discussions/product-feedback/345915 . It has been fixed already and I've reuploaded all the gzipped files that were lost PLUS the new files that were generated AFTER the issue was identified.

    [Aug 17] Seems the latest version of my dataset lost around 100+ files, good thing this dataset is versioned so one can just go back to the previous version(s) and download them. Version 188 HAS ALL THE LOST FILES, I wont be reuploading all datasets as it will be tedious and I've deleted them already in my local and I only store the latest 2-3 days.

    [Aug 10] 3/5 of my Python processes errored out and resulted to around 10-12 hours of NO data gathering for those processes thus the sharp decrease of tweets for Aug 09 dataset. I've applied an exception/ error checking to prevent this from happening.

    [Aug 09] Significant drop in tweets extracted, but I am now getting ORIGINAL/ NON-RETWEETS.

    [Aug 08] I've noticed that I had a spike of Tweets extracted, but they are literally thousands of retweets of a single original tweet. I also noticed that my crawlers seem to deviate because of this tactic being used by some Twitter users where they flood Twitter w...

  2. COVID-19 Case Surveillance Public Use Data

    • data.cdc.gov
    • opendatalab.com
    • +6more
    application/rdfxml +5
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
    Explore at:
    application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

    CDC has three COVID-19 case surveillance datasets:

    The following apply to all three datasets:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

    COVID-19 Case Reports

    COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

    All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
    • Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    For questions, please contact Ask SRRG (eocevent394@cdc.gov).

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

  3. Number of global social network users 2017-2028

    • statista.com
    • es.statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon, Number of global social network users 2017-2028 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    How many people use social media?

                  Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
    
                  Who uses social media?
                  Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
                  when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
    
                  How much time do people spend on social media?
                  Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
    
                  What are the most popular social media platforms?
                  Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
    
  4. FF | 2024 08 13 | Analog Dataset

    • kaggle.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramy Harik (2024). FF | 2024 08 13 | Analog Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/9878271
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramy Harik
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is one of two industry-grade datasets captured during an 8-hour continuous operation of the manufacturing assembly line at the Future Factories Lab, University of South Carolina, on 08/13/2024. The datasets adhere to industry standards, covering communication protocols, actuators, control mechanisms, transducers, sensors, and cameras. Data collection utilized both integrated and external sensors throughout the laboratory, including sensors embedded within the actuators and externally installed devices. Additionally, high-performance cameras captured key aspects of the operation. In a prior experiment [1], a 30-hour continuous run was conducted, during which all anomalies were documented. Maintenance procedures were subsequently implemented to reduce potential errors and operational disruptions. The two datasets include: (1) a time-series analog dataset, and (2) a multi-modal time-series dataset containing synchronized system data and images. These datasets aim to support future research in advancing manufacturing processes by providing a platform for testing novel algorithms without the need to recreate physical manufacturing environments. Moreover, the datasets are open-source and designed to facilitate the training of artificial intelligence models, streamlining research by offering comprehensive, ready-to-use resources for various applications and projects.

  5. u

    Labour Force Survey Two-Quarter Longitudinal Dataset, October 2024 - March...

    • beta.ukdataservice.ac.uk
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office For National Statistics (2025). Labour Force Survey Two-Quarter Longitudinal Dataset, October 2024 - March 2025 [Dataset]. http://doi.org/10.5255/ukda-sn-9389-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    Office For National Statistics
    Description

    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Longitudinal data
    The LFS retains each sample household for five consecutive quarters, with a fifth of the sample replaced each quarter. The main survey was designed to produce cross-sectional data, but the data on each individual have now been linked together to provide longitudinal information. The longitudinal data comprise two types of linked datasets, created using the weighting method to adjust for non-response bias. The two-quarter datasets link data from two consecutive waves, while the five-quarter datasets link across a whole year (for example January 2010 to March 2011 inclusive) and contain data from all five waves. A full series of longitudinal data has been produced, going back to winter 1992. Linking together records to create a longitudinal dimension can, for example, provide information on gross flows over time between different labour force categories (employed, unemployed and economically inactive). This will provide detail about people who have moved between the categories. Also, longitudinal information is useful in monitoring the effects of government policies and can be used to follow the subsequent activities and circumstances of people affected by specific policy initiatives, and to compare them with other groups in the population. There are however methodological problems which could distort the data resulting from this longitudinal linking. The ONS continues to research these issues and advises that the presentation of results should be carefully considered, and warnings should be included with outputs where necessary.

    New reweighting policy
    Following the new reweighting policy ONS has reviewed the latest population estimates made available during 2019 and have decided not to carry out a 2019 LFS and APS reweighting exercise. Therefore, the next reweighting exercise will take place in 2020. These will incorporate the 2019 Sub-National Population Projection data (published in May 2020) and 2019 Mid-Year Estimates (published in June 2020). It is expected that reweighted Labour Market aggregates and microdata will be published towards the end of 2020/early 2021.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each user guide volume alongside the appropriate questionnaire for the year concerned. However, volumes are updated periodically by ONS, so users are advised to check the latest documents on the ONS Labour Force Survey - User Guidance pages before commencing analysis. This is especially important for users of older QLFS studies, where information and guidance in the user guide documents may have changed over time.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly data; Secure Access datasets; household datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    Variables DISEA and LNGLST
    Dataset A08 (Labour market status of disabled people) which ONS suspended due to an apparent discontinuity between April to June 2017 and July to September 2017 is now available. As a result of this apparent discontinuity and the inconclusive investigations at this stage, comparisons should be made with caution between April to June 2017 and subsequent time periods. However users should note that the estimates are not seasonally adjusted, so some of the change between quarters could be due to seasonality. Further recommendations on historical comparisons of the estimates will be given in November 2018 when ONS are due to publish estimates for July to September 2018.

    An article explaining the quality assurance investigations that have been conducted so far is available on the ONS Methodology webpage. For any queries about Dataset A08 please email Labour.Market@ons.gov.uk.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

    2022 Weighting

    The population totals used for the latest LFS estimates use projected growth rates from Real Time Information (RTI) data for UK, EU and non-EU populations based on 2021 patterns. The total population used for the LFS therefore does not take into account any changes in migration, birth rates, death rates, and so on since June 2021, and hence levels estimates may be under- or over-estimating the true values and should be used with caution. Estimates of rates will, however, be robust.

    Production of two-quarter longitudinal data resumed, April 2024

    In April 2024, ONS resumed production of the two-quarter longitudinal data, along with quarterly household data. As detailed in the ONS Labour Market Transformation update of April 2024, for longitudinal data, flows between October to December 2023 and January to March 2024 will similarly mark the start of a new time series. This will be consistent with LFS weighting from equivalent person quarterly datasets, but will not be consistent with historic longitudinal data
    before this period.

  6. h

    caucasian-people-kyc-photo-dataset

    • huggingface.co
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2024). caucasian-people-kyc-photo-dataset [Dataset]. https://huggingface.co/datasets/TrainingDataPro/caucasian-people-kyc-photo-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2024
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Know Your Customer Dataset, Face Detection and Re-identification

      The similar dataset that includes all ethnicities - Selfies and ID Dataset
    

    80,000+ photos including 10,600+ document photos from 5,300 people from 28 countries. The dataset includes 2 photos of a person from his documents and 13 selfies. All people presented in the dataset are caucasian. The dataset contains a variety of images capturing individuals from diverse backgrounds and age groups. Photo documents… See the full description on the dataset page: https://huggingface.co/datasets/TrainingDataPro/caucasian-people-kyc-photo-dataset.

  7. XAU/USD Gold Price Historical Data (2004-2025)

    • kaggle.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Novandra Anugrah (2025). XAU/USD Gold Price Historical Data (2004-2025) [Dataset]. https://www.kaggle.com/datasets/novandraanugrah/xauusd-gold-price-historical-data-2004-2024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Novandra Anugrah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset historical price data for XAU/USD (gold vs USD) from 2004 to Feb 2025, captured across multiple timeframes including 5-minute, 15-minute, 30-minute, 1-hour, 4-hour, daily, weekly, and monthly intervals. Dataset includes Open, High, Low, Close prices, and Volume data.

  8. A dataset of Data Subject Access Request Packages

    • zenodo.org
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas (2024). A dataset of Data Subject Access Request Packages [Dataset]. http://doi.org/10.5281/zenodo.11634938
    Explore at:
    Dataset updated
    Jul 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicola Leschke; Nicola Leschke; Daniela Pöhn; Daniela Pöhn; Frank Pallas; Frank Pallas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is a minimal example of Data Subject Access Request Packages (SARPs), as they can be retrieved under data protection laws, specifically the GDPR. It includes data from two data subjects, each with accounts for five major sevices, namely Amazon, Apple, Facebook, Google, and Linkedin.

    Purpose and Usage

    This dataset is meant to be an initial dataset that allows for manual exploration of structures and contents found in SARPs. Hence, the number of controllers and user profiles should be minimal but sufficient to allow cross-subject and cross-controller analysis. This dataset can be used to explore structures, formats and data types found in real-world SARPs. Thereby, the planning of future SARP-based research projects and studies shall be facilitated.

    We invite other researchers to use this dataset to explore the structure of SARPs. The envisioned primary usage includes the development of user-centric privacy interfaces and other technical contributions in the area of data access rights. Moreover, these packages can also be used for examplified data analyses, although no substantive research questions can be answered using this data. In particular, this data does not reflect how data subjects behave in real world. However, it is representative enough to give a first impression on the types of data analysis possible when using real world data.

    Data Generation

    In order to allow cross-subject analysis, while keeping the re-identification risk minimal, we used research-only accounts for the data generation. A detailed explanation of the data generation method can be found in the paper corresponding to the dataset, accepted for the Annual Privacy Forum 2024.

    In short, two user profiles were designed and corresponding accounts were created for each of the five services. Then, those accounts were used for two to four month. During the usage period, we minimized the amount of identifying data and also avoided interactions with data subjects not part of this research. Afterwards, we performed a data access request via the controller's web interface. Finally, the data was cleansed as described in detail in the acconpanying paper and in brief within the following section.

    Data Cleansing

    Before publication, both possibly identifying information and security relevant attributes need to be obfuscated or deleted. Moreover, multi-party data (especially messages with external entities) must be deleted. If data is obfuscated, we made sure to substitute multiple occurances of the same information with the same replacement.
    We provide a list of deleted and obfuscated items, the obfuscation scheme and, if applicable, the replacement.

    The list of obfuscated items looks like the following example:

    pathfiletypefilenameattributeschemereplacement
    linkedin\Linkedin_Basiccsvmessages.csvTOsemantic descriptionFirstname Lastname
    gooogle\Meine Aktivitäten\DatenexporthtmlMeineAktivitäten.htmlIP Addressloopback127.142.201.194
    facebook\personal_informationjsonprofile_information.jsonemailssemantic descriptionfirstname.lastname@gmail.com

    Data Characterization

    To give you an overview of the dataset, we publicly provide some meta-data about the usage time and SARP characteristics of exports from subject A/ subject B.

    providerusage time
    (in month)
    export optionsfile types# subfolders# filesexport size
    Amazon2/4all categoriesCSV (32/49)
    EML (2/5)
    JPEG (1/2)
    JSON (3/3)
    PDF (9/10)
    TXT (4/4)
    41/4951/731.2 MB / 1.4 MB
    Apple2/4all data
    max. 1 GB/ max. 4 GB
    CSV (8/3)20/18/371.8 KB / 294.8 KB
    Facebook2/4

    all data

    JSON/HTML

    on my computer

    JSON (39/0)
    HTML (0/63)
    TXT (29/28)
    JPG (0/4)
    PNG (1/15)
    GIF (7/7)
    45/7676/11712.3 MB / 13.5 MB
    Google2/4

    all data

    frequency once

    ZIP

    max. 4 GB

    HTML (8/11)
    CSV (10/13)
    JSON (27/28)
    TXT (14/14)
    PDF (1/1)
    MBOX (1/1)
    VCF (1/0)
    ICS (1/0)
    README (1/1)
    JPG (0/2)
    44/5164/711.54 MB /1.2 MB
    LinkedIn2/4all dataCSV (18/21)0/0 (part 1/2)
    0/0 (part 1/2)
    13/18
    19/21

    3.9 KB / 6.0 KB

    6.2 KB / 9.2 KB


    Authors

    This data collection was performed by Daniela Pöhn (Universität der Bundeswehr München, Germany), Frank Pallas and Nicola Leschke (Paris Lodron Universität Salzburg, Austria). For questions, please contact nicola.leschke@plus.ac.at.

    Accompanying Paper

    The dataset was collected according to the method presented in:
    Leschke, Pöhn, and Pallas (2024). "How to Drill Into Silos: Creating a Free-to-Use Dataset of Data Subject Access Packages". Accepted for Annual Privacy Forum 2024.

  9. I

    Cline Center Coup d’État Project Dataset

    • databank.illinois.edu
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto (2025). Cline Center Coup d’État Project Dataset [Dataset]. http://doi.org/10.13012/B2IDB-9651987_V7
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Buddy Peyton; Joseph Bajjalieh; Dan Shalmon; Michael Martin; Emilio Soto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coups d'Ètat are important events in the life of a country. They constitute an important subset of irregular transfers of political power that can have significant and enduring consequences for national well-being. There are only a limited number of datasets available to study these events (Powell and Thyne 2011, Marshall and Marshall 2019). Seeking to facilitate research on post-WWII coups by compiling a more comprehensive list and categorization of these events, the Cline Center for Advanced Social Research (previously the Cline Center for Democracy) initiated the Coup d’État Project as part of its Societal Infrastructures and Development (SID) project. More specifically, this dataset identifies the outcomes of coup events (i.e., realized, unrealized, or conspiracy) the type of actor(s) who initiated the coup (i.e., military, rebels, etc.), as well as the fate of the deposed leader. Version 2.1.3 adds 19 additional coup events to the data set, corrects the date of a coup in Tunisia, and reclassifies an attempted coup in Brazil in December 2022 to a conspiracy. Version 2.1.2 added 6 additional coup events that occurred in 2022 and updated the coding of an attempted coup event in Kazakhstan in January 2022. Version 2.1.1 corrected a mistake in version 2.1.0, where the designation of “dissident coup” had been dropped in error for coup_id: 00201062021. Version 2.1.1 fixed this omission by marking the case as both a dissident coup and an auto-coup. Version 2.1.0 added 36 cases to the data set and removed two cases from the v2.0.0 data. This update also added actor coding for 46 coup events and added executive outcomes to 18 events from version 2.0.0. A few other changes were made to correct inconsistencies in the coup ID variable and the date of the event. Version 2.0.0 improved several aspects of the previous version (v1.0.0) and incorporated additional source material to include: • Reconciling missing event data • Removing events with irreconcilable event dates • Removing events with insufficient sourcing (each event needs at least two sources) • Removing events that were inaccurately coded as coup events • Removing variables that fell below the threshold of inter-coder reliability required by the project • Removing the spreadsheet ‘CoupInventory.xls’ because of inadequate attribution and citations in the event summaries • Extending the period covered from 1945-2005 to 1945-2019 • Adding events from Powell and Thyne’s Coup Data (Powell and Thyne, 2011)
    Items in this Dataset 1. Cline Center Coup d'État Codebook v.2.1.3 Codebook.pdf - This 15-page document describes the Cline Center Coup d’État Project dataset. The first section of this codebook provides a summary of the different versions of the data. The second section provides a succinct definition of a coup d’état used by the Coup d'État Project and an overview of the categories used to differentiate the wide array of events that meet the project's definition. It also defines coup outcomes. The third section describes the methodology used to produce the data. Revised February 2024 2. Coup Data v2.1.3.csv - This CSV (Comma Separated Values) file contains all of the coup event data from the Cline Center Coup d’État Project. It contains 29 variables and 1000 observations. Revised February 2024 3. Source Document v2.1.3.pdf - This 325-page document provides the sources used for each of the coup events identified in this dataset. Please use the value in the coup_id variable to identify the sources used to identify that particular event. Revised February 2024 4. README.md - This file contains useful information for the user about the dataset. It is a text file written in markdown language. Revised February 2024
    Citation Guidelines 1. To cite the codebook (or any other documentation associated with the Cline Center Coup d’État Project Dataset) please use the following citation: Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Scott Althaus. 2024. “Cline Center Coup d’État Project Dataset Codebook”. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7 2. To cite data from the Cline Center Coup d’État Project Dataset please use the following citation (filling in the correct date of access): Peyton, Buddy, Joseph Bajjalieh, Dan Shalmon, Michael Martin, Jonathan Bonaguro, and Emilio Soto. 2024. Cline Center Coup d’État Project Dataset. Cline Center for Advanced Social Research. V.2.1.3. February 27. University of Illinois Urbana-Champaign. doi: 10.13012/B2IDB-9651987_V7

  10. Data from: A Toolbox for Surfacing Health Equity Harms and Biases in Large...

    • springernature.figshare.com
    application/csv
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal (2024). A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models [Dataset]. http://doi.org/10.6084/m9.figshare.26133973.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Stephen R. Pfohl; Heather Cole-Lewis; Rory Sayres; Darlene Neal; Mercy Asiedu; Awa Dieng; Nenad Tomasev; Qazi Mamunur Rashid; Shekoofeh Azizi; Negar Rostamzadeh; Liam G. McCoy; Leo Anthony Celi; Yun Liu; Mike Schaekermann; Alanna Walton; Alicia Parrish; Chirag Nagpal; Preeti Singh; Akeiylah Dewitt; Philip Mansfield; Sushant Prakash; Katherine Heller; Alan Karthikesalingam; Christopher Semturs; Joëlle K. Barral; Greg Corrado; Yossi Matias; Jamila Smith-Loud; Ivor B. Horn; Karan Singhal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary material and data for Pfohl and Cole-Lewis et al., "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models" (2024).

    We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.

    We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].

    • Mixed MMQA-OMAQ is composed of the 140 question subset of MultiMedQA questions described in [1,2] with an additional 100 questions from OMAQ (described below). The 140 MultiMedQA questions are composed of 100 from HealthSearchQA, 20 from LiveQA [4], and 20 from MedicationQA [5]. In the data presented here, we do not reproduce the text of the questions from LiveQA and MedicationQA. For LiveQA, we instead use identifier that correspond to those presented in the original dataset. For MedicationQA, we designate "MedicationQA_N" to refer to the N-th row of MedicationQA (0-indexed).

    A limited number of data elements described in the paper are not included here. The following elements are excluded:

    1. The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.

    2. The free-text comments written by raters during the ratings process.

    3. Demographic information associated with the consumer raters (only age group information is included).

    References

    1. Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).

    2. Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2

    3. Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z

    4. Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.

    5. Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.

    Description of files and sheets

    1. Independent Ratings [ratings_independent.csv]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence with three possible values (No bias, Minor bias, Severe bias). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.

    2. Paired Ratings [ratings_pairwise.csv]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias). Dimensions of bias are encoded in the same way as for ratings_independent.csv. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.

    3. Counterfactual Paired Ratings [ratings_counterfactual.csv]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff, how_answers_diff). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.

    4. Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].

    5. Equity in Health AI (EHAI) [equitymedqa_ehai.csv]: Contains questions that compose the EHAI dataset.

    6. Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv]: Contains questions that compose the FBRT-Manual dataset.

    7. Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv]: Contains questions that compose the extended FBRT-LLM dataset.

    8. Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.

    9. TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv]: Contains questions that compose the TRINDS dataset.

    10. Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv]: Contains pairs of questions that compose the CC-Manual dataset.

    11. Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv]: Contains pairs of questions that compose the CC-LLM dataset.

    12. HealthSearchQA [other_datasets_healthsearchqa.csv]: Contains questions sampled from the HealthSearchQA dataset [1,2].

    13. Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq]: Contains questions that compose the Mixed MMQA-OMAQ dataset.

    14. Omiye et al. [other datasets_omiye_et_al]: Contains questions proposed in Omiye et al. [3].

    Version history

    Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)

    WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.

    NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

  11. Z

    Training Dataset for HNTSMRG 2024 Challenge

    • data.niaid.nih.gov
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dede, Cem (2024). Training Dataset for HNTSMRG 2024 Challenge [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199558
    Explore at:
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    Wahid, Kareem
    Dede, Cem
    Naser, Mohamed
    Fuller, Clifton
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Training Dataset for HNTSMRG 2024 Challenge

    Overview

    This repository houses the publicly available training dataset for the Head and Neck Tumor Segmentation for MR-Guided Applications (HNTSMRG) 2024 Challenge.

    Patient cohorts correspond to patients with histologically proven head and neck cancer who underwent radiotherapy (RT) at The University of Texas MD Anderson Cancer Center. The cancer types are predominately oropharyngeal cancer or cancer of unknown primary. Images include a pre-RT T2w MRI scan (1-3 weeks before start of RT) and a mid-RT T2w MRI scan (2-4 weeks intra-RT) for each patient. Segmentation masks of primary gross tumor volumes (abbreviated GTVp) and involved metastatic lymph nodes (abbreviated GTVn) are provided for each image (derived from multi-observer STAPLE consensus).

    HNTSMRG 2024 is split into 2 tasks:

    Task 1: Segmentation of tumor volumes (GTVp and GTVn) on pre-RT MRI.

    Task 2: Segmentation of tumor volumes (GTVp and GTVn) on mid-RT MRI.

    The same patient cases will be used for the training and test sets of both tasks of this challenge. Therefore, we are releasing a single training dataset that can be used to construct solutions for either segmentation task. The test data provided (via Docker containers), however, will be different for the two tasks. Please consult the challenge website for more details.

    Data Details

    DICOM files (images and structure files) have been converted to NIfTI format (.nii.gz) for ease of use by participants via DICOMRTTool v. 1.0.

    Images are a mix of fat-suppressed and non-fat-suppressed MRI sequences. Pre-RT and mid-RT image pairs for a given patient are consistently either fat-suppressed or non-fat-suppressed.

    Though some sequences may appear to be contrast enhancing, no exogenous contrast is used.

    All images have been manually cropped from the top of the clavicles to the bottom of the nasal septum (~ oropharynx region to shoulders), allowing for more consistent image field of views and removal of identifiable facial structures.

    The mask files have one of three possible values: background = 0, GTVp = 1, GTVn = 2 (in the case of multiple lymph nodes, they are concatenated into one single label). This labeling convention is similar to the 2022 HECKTOR Challenge.

    150 unique patients are included in this dataset. Anonymized patient numeric identifiers are utilized.

    The entire training dataset is ~15 GB.

    Dataset Folder/File Structure

    The dataset is uploaded as a ZIP archive. Please unzip before use. NIfTI files conform to the following standardized nomenclature: ID_timepoint_image/mask.nii.gz. For mid-RT files, a "registered" suffix (ID_timepoint_image/mask_registered.nii.gz) indicates the image or mask has been registered to the mid-RT image space (see more details in Additional Notes below).

    The data is provided with the following folder hierarchy:

    Top-level folder (named "HNTSMRG24_train")

    Patient-level folder (anonymized patient ID, example: "2")

    Pre-radiotherapy data folder ("preRT")

    Original pre-RT T2w MRI volume (example: "2_preRT_T2.nii.gz").

    Original pre-RT tumor segmentation mask (example: "2_preRT_mask.nii.gz").

    Mid-radiotherapy data folder ("midRT")

    Original mid-RT T2w MRI volume (example: "2_midRT_T2.nii.gz").

    Original mid-RT tumor segmentation mask (example: "2_midRT_mask.nii.gz").

    Registered pre-RT T2w MRI volume (example: "2_preRT_T2_registered.nii.gz").

    Registered pre-RT tumor segmentation mask (example: "2_preRT_mask_registered.nii.gz").

    Note: Cases will exhibit variable presentation of ground truth mask structures. For example, a case could have only a GTVp label present, only a GTVn label present, both GTVp and GTVn labels present, or a completely empty mask (i.e., complete tumor response at mid-RT). The following case IDs have empty masks at mid-RT (indicating a complete response): 21, 25, 29, 42. These empty masks are not errors. There will similarly be some cases in the test set for Task 2 that have empty masks.

    Details Relevant for Algorithm Building

    The goal of Task 1 is to generate a pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz" is the relevant label). During blind testing for Task 1, only the pre-RT MRI (e.g., "2_preRT_T2.nii.gz") will be provided to the participants algorithms.

    The goal of Task 2 is to generate a mid-RT segmentation mask (e.g., "2_midRT_mask.nii.gz" is the relevant label). During blind testing for Task 2, the mid-RT MRI (e.g., "2_midRT_T2.nii.gz"), original pre-RT MRI (e.g., "2_preRT_T2.nii.gz"), original pre-RT tumor segmentation mask (e.g., "2_preRT_mask.nii.gz"), registered pre-RT MRI (e.g., "2_preRT_T2_registered.nii.gz"), and registered pre-RT tumor segmentation mask (e.g., "2_preRT_mask_registered.nii.gz") will be provided to the participants algorithms.

    When building models, the resolution of the generated prediction masks should be the same as the corresponding MRI for the given task. In other words, the generated masks should be in the correct pixel spacing and origin with respect to the original reference frame (i.e., pre-RT image for Task 1, mid-RT image for Task 2). More details on the submission of models will be located on the challenge website.

    Additional Notes

    General notes.

    NIfTI format images and segmentations may be easily visualized in any NIfTI viewing software such as 3D Slicer.

    Test data will not be made public until the completion of the challenge. The complete training and test data will be published together (along with all original multi-observer annotations and relevant clinical data) at a later date via The Cancer Imaging Archive. Expected date ~ Spring 2025.

    Task 1 related notes.

    When training their algorithms for Task 1, participants can choose to use only pre-RT data or add in mid-RT data as well. Initially, our plan was to limit participants to utilizing only pre-RT data for training their algorithms in Task 1. However, upon reflection, we recognized that in a practical setting, individuals aiming to develop auto-segmentation algorithms could theoretically train models using any accessible data at their disposal. Based on current literature, we actually don't know what the best solution would be! Would the incorporation of mid-RT data for training a pre-RT segmentation model actually be helpful, or would it merely introduce harmful noise? The answer remains unclear. Therefore, we leave this choice to the participants.

    Remember, though, during testing, you will ONLY have the pre-RT image as an input to your model (naturally, since Task 1 is a pre-RT segmentation task and you won't know what mid-RT data for a patient will look like).

    Task 2 related notes.

    In addition to the mid-RT MRI and segmentation mask, we have also provided a registered pre-RT MRI and the corresponding registered pre-RT segmentation mask for each patient. We offer this data for participants who opt not to integrate any image registration techniques into their algorithms for Task 2 but still wish to use the two images as a joint input to their model. Moreover, in a real-world adaptive RT context, such registered scans are typically readily accessible. Naturally, participants are also free to incorporate their own image registration processes into their pipelines if they wish (or ignore the pre-RT images/masks altogether).

    Registrations were generated using SimpleITK, where the mid-RT image serves as the fixed image and the pre-RT image serves as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Zoo). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application (https://doi.org/10.1002/mp.16128). For cases where excessive warping was noted during deformable registration (a small minority of cases), only the rigid transformation was applied.

    Contact

    We have set up a general email address that you can message to notify all organizers at: hntsmrg2024@gmail.com. Additional specific organizer contacts:

    Kareem A. Wahid, PhD (kawahid@mdanderson.org)

    Cem Dede, MD (cdede@mdanderson.org)

    Mohamed A. Naser, PhD (manaser@mdanderson.org)

  12. SIGMOD 2024 Programming Contest Datasets

    • zenodo.org
    bin
    Updated Oct 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guoliang Li; Dong Deng; Guoliang Li; Dong Deng (2024). SIGMOD 2024 Programming Contest Datasets [Dataset]. http://doi.org/10.5281/zenodo.13998879
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guoliang Li; Dong Deng; Guoliang Li; Dong Deng
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Our datasets, both released and evaluation set, are derived from the YFCC100M Dataset. Each dataset comprises vectors encoded from images using the CLIP model, which are then reduced to 100 dimensions using Principal Component Analysis (PCA). Additionally, categorical and timestamp attributes are selected from the metadata of the images. The categorical attribute is discretized into integers starting from 0, and the timestamp attribute is normalized into floats between 0 and 1.

    For each query, a query type is randomly selected from four possible types, denoted by the numbers 0 to 3. Then, we randomly choose two data points from dataset D, utilizing their categorical attribute (C) timestamp attribute (T), and vectors, to determine the values of the query. Specifically:

    • Randomly sample two data points from D.
    • Use the categorical value of the first data point as v for the equality predicate over the categorical attribute C.
    • Use the timestamp attribute values of the two sampled data points for the range predicate. Designate l as the smaller timestamp value and r as the larger. The range predicate is thus defined as l≤T≤r.
    • Use the vector of the first data point as the query vector.
    • If the query type does not involve v, l, or r, their values are set to -1.

    We assure that at least 100 data points in D meet the query limit.

    Dataset Structure

    Dataset D is in a binary format, beginning with a 4-byte integer num_vectors (uint32_t) indicating the number of vectors. This is followed by data for each vector, stored consecutively, with each vector occupying 102 (2 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_vectors x 102 (2 + vector_num_dimension) x sizeof(float32) bytes in total. Specifically, for the 102 dimensions of each vector: the first dimension denotes the discretized categorical attribute C and the second dimension denotes the normalized timestamp attribute T. The rest 100 dimensions are the vector.

    Query Set Structure

    Query set Q is in a binary format, beginning with a 4-byte integer num_queries (uint32_t) indicating the number of queries. This is followed by data for each query, stored consecutively, with each query occupying 104 (4 + vector_num_dimension) x sizeof(float32) bytes, summing up to num_queries x 104 (4 + vector_num_dimension) x sizeof(float32) bytes in total.

    The 104-dimensional representation for a query is organized as follows:

    • The first dimension denotes query_type (takes values from 0, 1, 2, 3).
    • The second dimension denotes the specific query value v for the categorical attribute (if not queried, takes -1).
    • The third dimension denotes the specific query value l for the timestamp attribute (if not queried, takes -1).
    • The fourth dimension denotes the specific query value r for the timestamp attribute (if not queried, takes -1).
    • The rest 100 dimensions are the query vector.

    There are four types of queries, i.e., the query_type takes values from 0, 1, 2 and 3. The 4 types of queries correspond to:

    • If query_type=0: Vector-only query, i.e., the conventional approximate nearest neighbor (ANN) search query.
    • If query_type=1: Vector query with categorical attribute constraint, i.e., ANN search for data points satisfying C=v.
    • If query_type=2: Vector query with timestamp attribute constraint, i.e., ANN search for data points satisfying l≤T≤r.
    • If query_type=3: Vector query with both categorical and timestamp attribute constraints, i.e. ANN search for data points satisfying C=v and l≤T≤r.

    The predicate for the categorical attribute is an equality predicate, i.e., C=v. And the predicate for the timestamp attribute is a range predicate, i.e., l≤T≤r.

    Originally provided on https://dbgroup.cs.tsinghua.edu.cn/sigmod2024/task.shtml?content=datasets .

  13. COVID-19 Reported Patient Impact and Hospital Capacity by Facility -- RAW

    • catalog.data.gov
    • healthdata.gov
    • +5more
    Updated Jul 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Health and Human Services (2025). COVID-19 Reported Patient Impact and Hospital Capacity by Facility -- RAW [Dataset]. https://catalog.data.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-by-facility-raw
    Explore at:
    Dataset updated
    Jul 4, 2025
    Dataset provided by
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    Description

    After May 3, 2024, this dataset and webpage will no longer be updated because hospitals are no longer required to report data on COVID-19 hospital admissions, and hospital capacity and occupancy data, to HHS through CDC’s National Healthcare Safety Network. Data voluntarily reported to NHSN after May 1, 2024, will be available starting May 10, 2024, at COVID Data Tracker Hospitalizations. The following dataset provides facility-level data for hospital utilization aggregated on a weekly basis (Sunday to Saturday). These are derived from reports with facility-level granularity across two main sources: (1) HHS TeleTracking, and (2) reporting provided directly to HHS Protect by state/territorial health departments on behalf of their healthcare facilities. The hospital population includes all hospitals registered with Centers for Medicare & Medicaid Services (CMS) as of June 1, 2020. It includes non-CMS hospitals that have reported since July 15, 2020. It does not include psychiatric, rehabilitation, Indian Health Service (IHS) facilities, U.S. Department of Veterans Affairs (VA) facilities, Defense Health Agency (DHA) facilities, and religious non-medical facilities. For a given entry, the term “collection_week” signifies the start of the period that is aggregated. For example, a “collection_week” of 2020-11-15 means the average/sum/coverage of the elements captured from that given facility starting and including Sunday, November 15, 2020, and ending and including reports for Saturday, November 21, 2020. Reported elements include an append of either “_coverage”, “_sum”, or “_avg”. A “_coverage” append denotes how many times the facility reported that element during that collection week. A “_sum” append denotes the sum of the reports provided for that facility for that element during that collection week. A “_avg” append is the average of the reports provided for that facility for that element during that collection week. The file will be updated weekly. No statistical analysis is applied to impute non-response. For averages, calculations are based on the number of values collected for a given hospital in that collection week. Suppression is applied to the file for sums and averages less than four (4). In these cases, the field will be replaced with “-999,999”. A story page was created to display both corrected and raw datasets and can be accessed at this link: https://healthdata.gov/stories/s/nhgk-5gpv This data is preliminary and subject to change as more data become available. Data is available starting on July 31, 2020. Sometimes, reports for a given facility will be provided to both HHS TeleTracking and HHS Protect. When this occurs, to ensure that there are not duplicate reports, deduplication is applied according to prioritization rules within HHS Protect. For influenza fields listed in the file, the current HHS guidance marks these fields as optional. As a result, coverage of these elements are varied. For recent updates to the dataset, scroll to the bottom of the dataset description. On May 3, 2021, the following fields have been added to this data set. hhs_ids previous_day_admission_adult_covid_confirmed_7_day_coverage previous_day_admission_pediatric_covid_confirmed_7_day_coverage previous_day_admission_adult_covid_suspected_7_day_coverage previous_day_admission_pediatric_covid_suspected_7_day_coverage previous_week_personnel_covid_vaccinated_doses_administered_7_day_sum total_personnel_covid_vaccinated_doses_none_7_day_sum total_personnel_covid_vaccinated_doses_one_7_day_sum total_personnel_covid_vaccinated_doses_all_7_day_sum previous_week_patients_covid_vaccinated_doses_one_7_day_sum previous_week_patients_covid_vaccinated_doses_all_

  14. s

    Table F Annual Budget 2024 SDCC - Dataset - data.smartdublin.ie

    • data.smartdublin.ie
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Table F Annual Budget 2024 SDCC - Dataset - data.smartdublin.ie [Dataset]. https://data.smartdublin.ie/dataset/table-f-annual-budget-2024-sdcc1
    Explore at:
    Dataset updated
    Jan 5, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table F is the Expenditure and Income for the Budget Year and Estimated Outturn for the previous Year. It contains –‘Expenditure’ and ‘Income’ Adopted by the Council for the Budget Year; 'Expenditure’ and ‘Income’ Estimated by the Chief Executive for the Budget Year; 'Expenditure’ and ‘Income’ Adopted by the Council for the previous Year; ‘Expenditure’ and ‘Income’ Estimated Outturn for the previous Year. Table F provides a breakdown of the Expenditure to Sub-Service level and Income to Income Source per Council Division contained in Table A.In the published Annual Budget document, Table F is published as a separate table for each Division.Section 1 of Table F contains Expenditure broken down by ‘Division’, ‘Service’ and ‘Sub-Service’. Section 2 of Table F contains Income broken down by ‘Division’, ‘Income Type’ and ‘Income Source’. The data in this dataset is best interpreted by comparison with Table F in the published Annual Budget document which can be found at https://www.sdcc.ie/en/services/our-council/policies-and-plans/budgets-and-spending/annual-budget/Data fields for Table F are as follows –Doc : Table Reference Heading : Indicates sections in the Table - Table F is comprised of two sections : Income and Expenditure. Heading = 1 for all Expenditure records; Heading = 2 for all Income records. Ref : Division Reference Ref_Desc : Division Description Ref1 : Service Reference for all Expenditure records (i.e. Heading = 1) or Income Type for all Income records (i.e. Heading = 2) Ref1_Desc : Service Description for all Expenditure records (i.e. Heading = 1) or Income Type for all Income records (i.e. Heading = 2) Ref2 : Sub-Service Reference for all Expenditure records (i.e. Heading = 1) or Income Source for all Income records (i.e. Heading = 2) Ref2_Desc : Sub-Service Description for all Expenditure records (i.e. Heading = 1) or Income Source for all Income records (i.e. Heading = 2) Adop : Amount Adopted by Council for Budget Year EstCE : Amount Estimated by Chief Executive for Budget Year PY_Adop : Amount Adopted by Council for previous Financial Year PY_Outturn : Amount Estimated Outturn for previous Financial Year

  15. T

    PDI (Police Data Initiative) Crime Incidents

    • data.cincinnati-oh.gov
    csv, xlsx, xml
    Updated Aug 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Cincinnati (2025). PDI (Police Data Initiative) Crime Incidents [Dataset]. https://data.cincinnati-oh.gov/Safety/PDI-Police-Data-Initiative-Crime-Incidents/k59e-2pvf
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Aug 17, 2025
    Dataset authored and provided by
    City of Cincinnati
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Note: Due to the RMS change for CPS, this data set stops on 6/2/2024. For records beginning on 6/3/2024, please see the dataset at this link: https://data.cincinnati-oh.gov/safety/Reported-Crime-STARS-Category-Offenses-/7aqy-xrv9/about_data

    Data Description: This data represents reported Crime Incidents in the City of Cincinnati. Incidents are the records, of reported crimes, collated by an agency for management. Incidents are typically housed in a Records Management System (RMS) that stores agency-wide data about law enforcement operations. This does not include police calls for service, arrest information, final case determination, or any other incident outcome data.

    Data Creation: The Cincinnati Police Department's (CPD) records crime incidents in the City through Records Management System (RMS) that stores agency-wide data about law enforcement operations.

    Data Created By: The source of this data is the Cincinnati Police Department.

    Refresh Frequency: This data is updated daily.

    CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/8eaa-xrvz

    Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.

    Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).

    Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad

    Disclaimer: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.

  16. z

    Industrial screw driving dataset collection: Time series data for process...

    • zenodo.org
    tar
    Updated Jan 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolai West; Nikolai West; Jochen Deuse; Jochen Deuse (2025). Industrial screw driving dataset collection: Time series data for process monitoring and anomaly detection [Dataset]. http://doi.org/10.5281/zenodo.14769379
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    Nikolai West
    Authors
    Nikolai West; Nikolai West; Jochen Deuse; Jochen Deuse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Industrial Screw Driving Datasets

    Overview

    This repository contains a collection of real-world industrial screw driving datasets, designed to support research in manufacturing process monitoring, anomaly detection, and quality control. Each dataset represents different aspects and challenges of automated screw driving operations, with a focus on natural process variations and degradation patterns.

    Scenario nameNumber of work pieces used in the experimentsRepetitions (screw cylces) per workpiece Individual screws per workpieceTotal number of observationsNumber of unique classes Purpose
    S01_thread-degradation1002525.0001Investigation of thread degradation through repeated fastening
    S02_surface-friction25025212.5008Surface friction effects on screw driving operations
    S03_error-collection-1 12 >20
    S04_error-collection-22.500125.00025

    Dataset Collection

    The datasets were collected from operational industrial environments, specifically from automated screw driving stations used in manufacturing. Each scenario investigates specific mechanical phenomena that can occur during industrial screw driving operations:

    Currently Available Datasets:

    1. S01_thread-degradation

    • Focus: Investigation of thread degradation through repeated fastening
    • Samples: 5,000 screw operations (4,089 normal, 911 faulty)
    • Features: Natural degradation patterns, no artificial error induction
    • Equipment: Delta PT 40x12 screws, thermoplastic components
    • Process: 25 cycles per location, two locations per workpiece
    • First published in: HICSS 2024 (West & Deuse, 2024)

    2. S02_surface-friction

    • Focus: Surface friction effects on screw driving operations
    • Samples: 12,500 screw operations (9,512 normal, 2,988 faulty)
    • Features: Eight distinct surface conditions (baseline to mechanical damage)
    • Equipment: Delta PT 40x12 screws, thermoplastic components, surface treatment materials
    • Process: 25 cycles per location, two locations per workpiece
    • First published in: CIE51 2024 (West & Deuse, 2024) [DOI will be added after publication]

    Upcoming Datasets:

    3. S03_screw-error-collection-1 (recorded but unpublished)

    • Focus: Varius manipulations of the screw driving process
    • Features: More than 20 different errors recorded
    • First published in: Publication planned
    • Status: In preparation

    4. S04_screw-error-collection-2 (recorded but unpublished)

    • Focus: Varius manipulations of the screw driving process
    • Features: 25 distinct errors recorded over the course of a week
    • First published in: Publication planned
    • Status: In preparation

    5. S05_upper-workpiece-manipulations (recorded but unpublished)

    • Manipulations of the injection molding process with no changes during tightening

    6. S06_lower-workpiece-manipulations (recorded but unpublished)

    • Manipulations of the injection molding process with no changes during tightening

    Additional scenarios may be added to this collection as they become available.

    Data Format

    Each dataset follows a standardized structure:

    • JSON files containing individual screw operation data
    • CSV files with operation metadata and labels
    • Comprehensive documentation in README files
    • Example code for data loading and processing is available in the companion library PyScrew

    Research Applications

    These datasets are suitable for various research purposes:

    • Machine learning model development and validation
    • Process monitoring and control systems
    • Quality assurance methodology development
    • Manufacturing analytics research
    • Anomaly detection algorithm benchmarking

    Usage Notes

    • All datasets include both normal operations and natural process anomalies
    • Complete time series data for torque, angle, and additional parameters available
    • Detailed documentation of experimental conditions and setup
    • Data collection procedures and equipment specifications available

    Access and Citation

    These datasets are provided under an open-access license to support research and development in manufacturing analytics. When using any of these datasets, please cite the corresponding publication as detailed in each dataset's README file.

    Related Tools

    We recommend using our library PyScrew to load and prepare the data. However, the the datasets can be processed using standard JSON and CSV processing libraries. Common data analysis and machine learning frameworks may be used for the analysis. The .tar file provided all information required for each scenario.

    Documentation

    Each dataset includes:

    • Detailed README files
    • Data format specifications
    • Equipment and process parameters
    • Experimental setup documentation
    • Citation information

    Contact and Support


    For questions, issues, or collaboration interests regarding these datasets, please:

    1. Open an issue in the respective GitHub repository
    2. Contact the authors through the provided institutional channels

    Acknowledgments

    These datasets were collected and prepared from:

    • RIF Institute for Research and Transfer e.V.
    • Technical University Dortmund, Institute for Production Systems

    The research was supported by:

    • German Ministry of Education and Research (BMBF)
    • European Union's "NextGenerationEU" program
    • The research is part of this funding program
    • More information regarding the research project is available here
  17. Human Vital Sign Dataset

    • kaggle.com
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DatasetEngineer (2024). Human Vital Sign Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8992827
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DatasetEngineer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview The Human Vital Signs Dataset is a comprehensive collection of key physiological parameters recorded from patients. This dataset is designed to support research in medical diagnostics, patient monitoring, and predictive analytics. It includes both original attributes and derived features to provide a holistic view of patient health.

    Attributes Patient ID

    Description: A unique identifier assigned to each patient. Type: Integer Example: 1, 2, 3, ... Heart Rate

    Description: The number of heartbeats per minute. Type: Integer Range: 60-100 bpm (for this dataset) Example: 72, 85, 90 Respiratory Rate

    Description: The number of breaths taken per minute. Type: Integer Range: 12-20 breaths per minute (for this dataset) Example: 16, 18, 15 Timestamp

    Description: The exact time at which the vital signs were recorded. Type: Datetime Format: YYYY-MM-DD HH:MM Example: 2023-07-19 10:15:30 Body Temperature

    Description: The body temperature measured in degrees Celsius. Type: Float Range: 36.0-37.5°C (for this dataset) Example: 36.7, 37.0, 36.5 Oxygen Saturation

    Description: The percentage of oxygen-bound hemoglobin in the blood. Type: Float Range: 95-100% (for this dataset) Example: 98.5, 97.2, 99.1 Systolic Blood Pressure

    Description: The pressure in the arteries when the heart beats (systolic pressure). Type: Integer Range: 110-140 mmHg (for this dataset) Example: 120, 130, 115 Diastolic Blood Pressure

    Description: The pressure in the arteries when the heart rests between beats (diastolic pressure). Type: Integer Range: 70-90 mmHg (for this dataset) Example: 80, 75, 85 Age

    Description: The age of the patient. Type: Integer Range: 18-90 years (for this dataset) Example: 25, 45, 60 Gender

    Description: The gender of the patient. Type: Categorical Categories: Male, Female Example: Male, Female Weight (kg)

    Description: The weight of the patient in kilograms. Type: Float Range: 50-100 kg (for this dataset) Example: 70.5, 80.3, 65.2 Height (m)

    Description: The height of the patient in meters. Type: Float Range: 1.5-2.0 m (for this dataset) Example: 1.75, 1.68, 1.82 Derived Features Derived_HRV (Heart Rate Variability)

    Description: A measure of the variation in time between heartbeats. Type: Float Formula: 𝐻 𝑅

    𝑉

    Standard Deviation of Heart Rate over a Period Mean Heart Rate over the Same Period HRV= Mean Heart Rate over the Same Period Standard Deviation of Heart Rate over a Period ​

    Example: 0.10, 0.12, 0.08 Derived_Pulse_Pressure (Pulse Pressure)

    Description: The difference between systolic and diastolic blood pressure. Type: Integer Formula: 𝑃

    𝑃

    Systolic Blood Pressure − Diastolic Blood Pressure PP=Systolic Blood Pressure−Diastolic Blood Pressure Example: 40, 45, 30 Derived_BMI (Body Mass Index)

    Description: A measure of body fat based on weight and height. Type: Float Formula: 𝐵 𝑀

    𝐼

    Weight (kg) ( Height (m) ) 2 BMI= (Height (m)) 2

    Weight (kg) ​

    Example: 22.8, 25.4, 20.3 Derived_MAP (Mean Arterial Pressure)

    Description: An average blood pressure in an individual during a single cardiac cycle. Type: Float Formula: 𝑀 𝐴

    𝑃

    Diastolic Blood Pressure + 1 3 ( Systolic Blood Pressure − Diastolic Blood Pressure ) MAP=Diastolic Blood Pressure+ 3 1 ​ (Systolic Blood Pressure−Diastolic Blood Pressure) Example: 93.3, 100.0, 88.7 Target Feature Risk Category Description: Classification of patients into "High Risk" or "Low Risk" based on their vital signs. Type: Categorical Categories: High Risk, Low Risk Criteria: High Risk: Any of the following conditions Heart Rate: > 90 bpm or < 60 bpm Respiratory Rate: > 20 breaths per minute or < 12 breaths per minute Body Temperature: > 37.5°C or < 36.0°C Oxygen Saturation: < 95% Systolic Blood Pressure: > 140 mmHg or < 110 mmHg Diastolic Blood Pressure: > 90 mmHg or < 70 mmHg BMI: > 30 or < 18.5 Low Risk: None of the above conditions Example: High Risk, Low Risk This dataset, with a total of 200,000 samples, provides a robust foundation for various machine learning and statistical analysis tasks aimed at understanding and predicting patient health outcomes based on vital signs. The inclusion of both original attributes and derived features enhances the richness and utility of the dataset.

  18. m

    Fine particulate matter (PM2.5) datasets for both Indoor and Outdoor...

    • data.mendeley.com
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kabir Bahadur Shah (2025). Fine particulate matter (PM2.5) datasets for both Indoor and Outdoor environments at two South Texans residences. [Dataset]. http://doi.org/10.17632/yk2pz4hcb5.2
    Explore at:
    Dataset updated
    Jan 24, 2025
    Authors
    Kabir Bahadur Shah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes high-frequency PM2.5 (µg/m3) measurements collected at 2-minute intervals, along with calculated 1-hour and daily average concentrations collected at two residences in the Rio Grande Valley (RGV), Texas, using PurpleAir-II-SD monitor. Additionally, it includes temperature (ºC) and relative humidity (%) measured by the sensors. The data was collected for one month during the summer from indoor (kitchen) and outdoor (front yard) environments at two locations: Home-1 in Weslaco from June 16, 2024, to July 16, 2024, and Home-2 in Mission from June 22, 2024, to July 16, 2024. Furthermore, the resultant wind speed and wind direction data were obtained from nearby Texas Commission on Environmental Quality (TCEQ) Continuous Ambient Monitoring Stations (CAMS), providing a broader understanding of meteorological influences on air quality. The raw PM2.5 data were corrected using Barkjohn's US-wide correction equation to increase the accuracy of the measurements and are presented in the datasets. The dataset is available in three CSV files: one for Home-1, one for Home-2, and one for wind data (Wind Rose) from the TCEQ stations.

  19. s

    Table E Annual Budget 2024 SDCC - Dataset - data.smartdublin.ie

    • data.smartdublin.ie
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Table E Annual Budget 2024 SDCC - Dataset - data.smartdublin.ie [Dataset]. https://data.smartdublin.ie/dataset/table-e-annual-budget-2024-sdcc1
    Explore at:
    Dataset updated
    Jan 5, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table E is the Analysis of Budget Income from Grants and Subsidies and LPT Self-Funding.Section 1 of Table E contains Income from the Department of Environment, Community and Local Government by Division including -‘Income’ by ‘Source of Income’ from Grants, Subsidies & LPT Self Funding for the Budget Year; ‘Income’ by ‘Source of Income’ from Grants and Subsidies and LPT Self Funding for the Budget Year. Section 2 of Table E contains Income from Other Departments and Bodies including –‘Income’ by ‘Source of Income’ from Grants and Subsidies for the Budget Year; ‘Income’ by ‘Source of Income’ from Grants and Subsidies for the Previous Financial Year. The data in this dataset is best interpreted by comparison with Table E in the published Annual Budget document which can be found at https://www.sdcc.ie/en/services/our-council/policies-and-plans/budgets-and-spending/annual-budget/Data fields for Table E are as follows –Doc : Table Reference Heading : Indicates sections in the Table - Table E is comprised of two sections, therefore Heading = 1 for all records in the first section; Heading = 2 for all records in the second section. Ref : Source of Income Reference Desc : Source of Income Description Inc : Income Adopted by Council for Budget Year PY : Income for Previous Financial Year Sort : Sorting Code

  20. Statewide Death Profiles

    • data.chhs.ca.gov
    • data.ca.gov
    • +3more
    csv
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Public Health (2025). Statewide Death Profiles [Dataset]. https://data.chhs.ca.gov/dataset/statewide-death-profiles
    Explore at:
    csv(2026589), csv(385695), csv(463460), csv(5401561), csv(164006)Available download formats
    Dataset updated
    Jul 28, 2025
    Dataset authored and provided by
    California Department of Public Healthhttps://www.cdph.ca.gov/
    Description

    This dataset contains counts of deaths for California as a whole based on information entered on death certificates. Final counts are derived from static data and include out-of-state deaths to California residents, whereas provisional counts are derived from incomplete and dynamic data. Provisional counts are based on the records available when the data was retrieved and may not represent all deaths that occurred during the time period. Deaths involving injuries from external or environmental forces, such as accidents, homicide and suicide, often require additional investigation that tends to delay certification of the cause and manner of death. This can result in significant under-reporting of these deaths in provisional data.

    The final data tables include both deaths that occurred in California regardless of the place of residence (by occurrence) and deaths to California residents (by residence), whereas the provisional data table only includes deaths that occurred in California regardless of the place of residence (by occurrence). The data are reported as totals, as well as stratified by age, gender, race-ethnicity, and death place type. Deaths due to all causes (ALL) and selected underlying cause of death categories are provided. See temporal coverage for more information on which combinations are available for which years.

    The cause of death categories are based solely on the underlying cause of death as coded by the International Classification of Diseases. The underlying cause of death is defined by the World Health Organization (WHO) as "the disease or injury which initiated the train of events leading directly to death, or the circumstances of the accident or violence which produced the fatal injury." It is a single value assigned to each death based on the details as entered on the death certificate. When more than one cause is listed, the order in which they are listed can affect which cause is coded as the underlying cause. This means that similar events could be coded with different underlying causes of death depending on variations in how they were entered. Consequently, while underlying cause of death provides a convenient comparison between cause of death categories, it may not capture the full impact of each cause of death as it does not always take into account all conditions contributing to the death.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BwandoWando (2024). (🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset [Dataset]. https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows
Organization logo

(🌇Sunset) 🇺🇦 Ukraine Conflict Twitter Dataset

Daily datasets of tweets about the ongoing Ukraine Russia Conflict

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
zip(18174367560 bytes)Available download formats
Dataset updated
Apr 2, 2024
Authors
BwandoWando
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered
Ukraine
Description

IMPORTANT (02-Apr-2024)

Kaggle has fixed the issue with gzip files and Version 510 should now reflect properly working files

IMPORTANT (28-Mar-2024)

Please use the version 508 of the dataset, as 509 is broken. See link below of the dataset that is properly working https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/versions/508

Context

The context and history of the current ongoing conflict can be found https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukraine.

Announcement

[Jun 16] (🌇Sunset) Twitter has finally pulled the plug on all of my remaining TWITTER API accounts as part of their efforts for developers to migrate to the new API. The last tweets that I pulled was dated last Jun 14, and no more data from Jun 15 onwards. It was fun til it lasted and I hope that this dataset was able and will continue to help a lot. I'll just leave the dataset here for future download and reference. Thank you all!

[Apr 19] Two additional developer accounts have been permanently suspended, expect a lower throughtput in the next few weeks. I will pull data til they ban my last account.

[Apr 08] I woke up this morning and saw that Twitter has banned/ permanently suspended 4 of my developer accounts, I have around a few more but it is just a matter of time till all my accounts will most likely get banned as well. This was a fun project that I maintained for as long as I can. I will pull data til my last account gets banned.

[Feb 26] I've started to pull in RETWEETS again, so I am expecting a significant amount of throughput in tweets again on top of the dedicated processes that I have that gets NONRETWEETS. If you don't want RETWEETS, just filter them out.

[Feb 24] It's been a year since I started getting tweets of this conflict and had no idea that a year later this is still ongoing. Almost everyone assumed that Ukraine will crumble in a matter of days, but it is not the case. To those who have been using my dataset, i hope that I am helping all of you in one way or another. Ill do my best to maintain updating this dataset as long as I can.

[Feb 02] I seem to be getting less tweets as my crawlers are getting throttled, i used to get 2500 tweets per 15 mins but around 2-3 of my crawlers are getting throttling limit errors. There may be some kind of update that Twitter has done about rate limits or something similar. Will try to find ways to increase the throughput again.

[Jan 02] For all new datasets, it will now be prefixed by a year, so for Jan 01, 2023, it will be 20230101_XXXX.

[Dec 28] For those looking for a cleaned version of my dataset, with the retweets removed from before Aug 08, here is a dataset by @@vbmokin https://www.kaggle.com/datasets/vbmokin/russian-invasion-ukraine-without-retweets

[Nov 19] I noticed that one of my developer accounts, which ISNT TWEETING ANYTHING and just pulling data out of twitter has been permanently banned by Twitter.com, thus the decrease of unique tweets. I will try to come up with a solution to increase my throughput and signup for a new developer account.

[Oct 19] I just noticed that this dataset is finally "GOLD", after roughly seven months since I first uploaded my gzipped csv files.

[Oct 11] Sudden spike in number of tweets revolving around most recent development(s) about the Kerch Bridge explosion and the response from Russia.

[Aug 19- IMPORTANT] I raised the missing dataset issue to Kaggle team and they confirmed it was a bug brought by a ReactJs upgrade, the conversation and details can be seen here https://www.kaggle.com/discussions/product-feedback/345915 . It has been fixed already and I've reuploaded all the gzipped files that were lost PLUS the new files that were generated AFTER the issue was identified.

[Aug 17] Seems the latest version of my dataset lost around 100+ files, good thing this dataset is versioned so one can just go back to the previous version(s) and download them. Version 188 HAS ALL THE LOST FILES, I wont be reuploading all datasets as it will be tedious and I've deleted them already in my local and I only store the latest 2-3 days.

[Aug 10] 3/5 of my Python processes errored out and resulted to around 10-12 hours of NO data gathering for those processes thus the sharp decrease of tweets for Aug 09 dataset. I've applied an exception/ error checking to prevent this from happening.

[Aug 09] Significant drop in tweets extracted, but I am now getting ORIGINAL/ NON-RETWEETS.

[Aug 08] I've noticed that I had a spike of Tweets extracted, but they are literally thousands of retweets of a single original tweet. I also noticed that my crawlers seem to deviate because of this tactic being used by some Twitter users where they flood Twitter w...

Search
Clear search
Close search
Google apps
Main menu