95 datasets found

Share of companies collecting personal data 2021, by data subject region
statista.com
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Share of companies collecting personal data 2021, by data subject region [Dataset]. https://www.statista.com/statistics/1172965/firms-collecting-personal-data/
Explore at:
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2021
Area covered
Worldwide
Description
A 2021 poll conducted among privacy experts worldwide showed that ** percent of companies collected personal data of subjects living in the EU, while ** percent of firms did the same for individuals living in Canada. A further ** percent of survey respondents stated that their companies collected personal data from identifiable subjects in the United Kingdom.
Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
US Broadband Usage Across Counties
kaggle.com
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). US Broadband Usage Across Counties [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-broadband-usage-across-counties-and-zip-codes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Area covered
United States
Description
US Broadband Usage Across Counties

Utilizing Microsoft's Data to Estimate Access

By Amber Thomas [source]

About this dataset

This dataset provides an estimation of broadband usage in the United States, focusing on how many people have access to broadband and how many are actually using it at broadband speeds. Through data collected by Microsoft from our services, including package size and total time of download, we can estimate the throughput speed of devices connecting to the internet across zip codes and counties.

According to Federal Communications Commission (FCC) estimates, 14.5 million people don't have access to any kind of broadband connection. This data set aims to address this contrast between those with estimated availability but no actual use by providing more accurate usage numbers downscaled to county and zip code levels. Who gets counted as having access is vastly important -- it determines who gets included in public funding opportunities dedicated solely toward closing this digital divide gap. The implications can be huge: millions around this country could remain invisible if these number aren't accurately reported or used properly in decision-making processes.

This dataset includes aggregated information about these locations with less than 20 devices for increased accuracy when estimating Broadband Usage in the United States-- allowing others to use it for developing solutions that improve internet access or label problem areas accurately where no real or reliable connectivity exists among citizens within communities large and small throughout the US mainland.. Please review the license terms before using these data so that you may adhere appropriately with stipulations set forth under Microsoft's Open Use Of Data Agreement v1.0 agreement prior to utilizing this dataset for your needs-- both professional and educational endeavors alike!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use the US Broadband Usage Dataset

This dataset provides broadband usage estimates in the United States by county and zip code. It is ideally suited for research into how broadband connects households, towns and cities. Understanding this information is vital for closing existing disparities in access to high-speed internet, and for devising strategies for making sure all Americans can stay connected in a digital world.

The dataset contains six columns: - County – The name of the county for which usage statistics are provided. - Zip Code (5-Digit) – The 5-digit zip code from which usage data was collected from within that county or metropolitan area/micro area/divisions within states as reported by the US Census Bureau in 2018[2].
- Population (Households) – Estimated number of households defined according to [3] based on data from the US Census Bureau American Community Survey's 5 Year Estimates[4].
- Average Throughput (Mbps)- Average Mbps download speed derived from a combination of data collected anonymous devices connected through Microsoft services such as Windows Update, Office 365, Xbox Live Core Services, etc.[5]
- Percent Fast (> 25 Mbps)- Percentage of machines with throughput greater than 25 Mbps calculated using [6]. 6) Percent Slow (< 3 Mbps)- Percentage of machines with throughput less than 3Mbps calculated using [7].

Research Ideas

Targeting marketing campaigns based on broadband use. Companies can use the geographic and demographic data in this dataset to create targeted advertising campaigns that are tailored to individuals living in areas where broadband access is scarce or lacking.

Creating an educational platform for those without reliable access to broadband internet. By leveraging existing technologies such as satellite internet, media streaming services like Netflix, and platforms such as Khan Academy or EdX, those with limited access could gain access to new educational options from home.

Establishing public-private partnerships between local governments and telecom providers need better data about gaps in service coverage and usage levels in order to make decisions about investments into new infrastructure buildouts for better connectivity options for rural communities

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

See the dataset description for more information.

Columns

File: broadband_data_2020October.csv

Acknowledgements

If you use this dataset in your research,...
The Global Cost of 1GB of Mobile Data
kaggle.com
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). The Global Cost of 1GB of Mobile Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-global-cost-of-1gb-of-mobile-data/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
The Global Cost of 1GB of Mobile Data

How Much Does it Really Cost?

By Andy Kriebel [source]

About this dataset

This dataset contains the average price of 1GB of mobile data by country. It includes data for over 150 countries, providing a valuable resource for anyone interested in global mobile data pricing trends. The data is sourced from Visual Capitalist, and was last updated in 2021

How to use the dataset

This dataset contains the average price of 1GB of mobile data by country as of April 2021. The data is sourced from Visual Capitalist.

To use this dataset, you can simply download it and then open it in your preferred spreadsheet software. The dataset is organized by rank, with the most expensive countries being listed first. Each row also lists the country's name and its corresponding average price for 1GB of mobile data.

If you want to compare the cost of mobile data across different countries, this dataset provides a useful starting point. You can also use it to track how prices have changed over time by comparing against previous editions of the dataset

Research Ideas

1. To compare the cost of living in different countries.

2. To find out which countries have the most expensive mobile data plans.

**3. To see how the cost of mobile data has changed over time

Acknowledgements

Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: Cost of 1GB of Data.csv | Column name | Description | |:---------------------------|:---------------------------------------------------------------------------| | Rank | The rank of the country based on the cost of 1GB of mobile data. (Numeric) | | Country | The country where the data was collected. (String) | | Avg Price of 1GB (USD) | The average price of 1GB of mobile data in the country, in USD. (Numeric) |

Acknowledgements

If you use this dataset in your research, please credit Andy Kriebel.
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Equatorial Guinea, Colombia, Iceland, Belize, Russian Federation, Antigua and Barbuda, Benin, Djibouti, Qatar, Saudi Arabia
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
Internet of Things - number of connected devices worldwide 2015-2025
statista.com
Updated Nov 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2016). Internet of Things - number of connected devices worldwide 2015-2025 [Dataset]. https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/
Explore at:
Dataset updated
Nov 27, 2016
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
By 2025, forecasts suggest that there will be more than ** billion Internet of Things (IoT) connected devices in use. This would be a nearly threefold increase from the IoT installed base in 2019. What is the Internet of Things? The IoT refers to a network of devices that are connected to the internet and can “communicate” with each other. Such devices include daily tech gadgets such as the smartphones and the wearables, smart home devices such as smart meters, as well as industrial devices like smart machines. These smart connected devices are able to gather, share, and analyze information and create actions accordingly. By 2023, global spending on IoT will reach *** trillion U.S. dollars. How does Internet of Things work? IoT devices make use of sensors and processors to collect and analyze data acquired from their environments. The data collected from the sensors will be shared by being sent to a gateway or to other IoT devices. It will then be either sent to and analyzed in the cloud or analyzed locally. By 2025, the data volume created by IoT connections is projected to reach a massive total of **** zettabytes. Privacy and security concerns   Given the amount of data generated by IoT devices, it is no wonder that data privacy and security are among the major concerns with regard to IoT adoption. Once devices are connected to the Internet, they become vulnerable to possible security breaches in the form of hacking, phishing, etc. Frequent data leaks from social media raise earnest concerns about information security standards in today’s world; were the IoT to become the next new reality, serious efforts to create strict security stands need to be prioritized.
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Andorra, Bangladesh, Isle of Man, Nepal, Taiwan, Moldova (Republic of), British Indian Ocean Territory, Tunisia, Canada, Northern Mariana Islands
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Data from: ipaast project - community stakeholder survey data and basic...
zenodo.org
data.niaid.nih.gov
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eamonn Baldwin; Rachel Opitz; Rachel Opitz; Eamonn Baldwin (2023). ipaast project - community stakeholder survey data and basic analysis [Dataset]. http://doi.org/10.5281/zenodo.7831650
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7831650
Dataset updated
Apr 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eamonn Baldwin; Rachel Opitz; Rachel Opitz; Eamonn Baldwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data provides the basis for the report titled

"Ready for integrated sustainable agricultural land management?

Are practitioners in archaeology and agriculture informed, willing, enabled, and motivated to change how they work with remote and near-surface sensing data to collaboratively address contemporary challenges in sustainable agricultural land management? "

Data were collected in compliance with the University of Glasgow's Research Ethics Policy (Application #100200154).

As stated in the Methods section of this report:

"The participatory survey was conducted between May 2021 and October 2022.

Location: The preponderance of stakeholders engaged with are professional practitioners or researchers based in the UK, Belgium, Italy, Cyprus, Spain and France. Sessions occurred remotely (online/phone), as well as on site, during workshops at the University of Glasgow, the Dalswinton Estate, Dumfries, and Manor Farm, Yedingham.

Participants

Selection: A sub-group of 51 high-level participants were selected from a greater network of 86 stakeholders who were engaged with during the ipaast project.

Sector: Farmers, researchers, heritage managers, geophysicists, remote sensing specialists, statisticians, soil scientists, service providers, sensor developers, and data archivists, who all deal directly, or indirectly with datasets relating to the measurement of soil and/or plant properties (physical, chemical, microbial) were represented (Table 1)

Expertise: Engagement with mid- to late- career specialists was prioritised, with many participants having over 20 years of experience and most having over 10 years of experience (including time during the PhD).

Interview method

Engagement with stakeholders was primarily through one-to-one interviews and structured workshop discussions, conducted either in person, or remotely over video conference or phone. In some instances, participants provided written input (see Table 2 summary). Follow-up interviews or written exchanges were used to clarify or continue discussions when required. A semi-structured approach to interviews and discussions was preferred, with a mix of general questions (see sample questions), as well as questions specifically tailored to the participants specialist background and experience.

Sample Questions:

What types of sensing data do you use/collect?

Where/how do you access/collect these data?

What are your main aims/applications in using or collecting these data?

How often do you access/collect, or anticipate accessing/collecting, these data to be useful to you?

What spatial resolution is necessary for these data to be useful to you?

What, if anything, would encourage/discourage you from sharing your data?

What kinds of additional data types or additional information (metadata) might help you to better understand and use data which you have previously collected or received?

What do you see as the main impacts, if any, of ecosystem service frameworks and/or recent changes to rural/environmental regulations on your work?

What attitudes to sensing data do you see from other stakeholders in rural affairs?

Documentation: Where viable, interviews and workshop discussions were recorded and transcribed; alternatively, notes were made during engagement by either the interviewer and/or dedicated participant observers (e.g. at workshops). Where notes were used, specific quotes and summary reports were checked with the participants for accuracy. "
d
The Marshall Project: COVID Cases in Prisons
data.world
csv, zip
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2023). The Marshall Project: COVID Cases in Prisons [Dataset]. https://data.world/associatedpress/marshall-project-covid-cases-in-prisons
Explore at:
csv, zipAvailable download formats
Dataset updated
Apr 6, 2023
Authors
The Associated Press
Time period covered
Jul 31, 2019 - Aug 1, 2021
Description
Overview

The Marshall Project, the nonprofit investigative newsroom dedicated to the U.S. criminal justice system, has partnered with The Associated Press to compile data on the prevalence of COVID-19 infection in prisons across the country. The Associated Press is sharing this data as the most comprehensive current national source of COVID-19 outbreaks in state and federal prisons.

Lawyers, criminal justice reform advocates and families of the incarcerated have worried about what was happening in prisons across the nation as coronavirus began to take hold in the communities outside. Data collected by The Marshall Project and AP shows that hundreds of thousands of prisoners, workers, correctional officers and staff have caught the illness as prisons became the center of some of the country’s largest outbreaks. And thousands of people — most of them incarcerated — have died.

In December, as COVID-19 cases spiked across the U.S., the news organizations also shared cumulative rates of infection among prison populations, to better gauge the total effects of the pandemic on prison populations. The analysis found that by mid-December, one in five state and federal prisoners in the United States had tested positive for the coronavirus -- a rate more than four times higher than the general population.

This data, which is updated weekly, is an effort to track how those people have been affected and where the crisis has hit the hardest.

Methodology and Caveats

The data tracks the number of COVID-19 tests administered to people incarcerated in all state and federal prisons, as well as the staff in those facilities. It is collected on a weekly basis by Marshall Project and AP reporters who contact each prison agency directly and verify published figures with officials.

Each week, the reporters ask every prison agency for the total number of coronavirus tests administered to its staff members and prisoners, the cumulative number who tested positive among staff and prisoners, and the numbers of deaths for each group.

The time series data is aggregated to the system level; there is one record for each prison agency on each date of collection. Not all departments could provide data for the exact date requested, and the data indicates the date for the figures.

To estimate the rate of infection among prisoners, we collected population data for each prison system before the pandemic, roughly in mid-March, in April, June, July, August, September and October. Beginning the week of July 28, we updated all prisoner population numbers, reflecting the number of incarcerated adults in state or federal prisons. Prior to that, population figures may have included additional populations, such as prisoners housed in other facilities, which were not captured in our COVID-19 data. In states with unified prison and jail systems, we include both detainees awaiting trial and sentenced prisoners.

To estimate the rate of infection among prison employees, we collected staffing numbers for each system. Where current data was not publicly available, we acquired other numbers through our reporting, including calling agencies or from state budget documents. In six states, we were unable to find recent staffing figures: Alaska, Hawaii, Kentucky, Maryland, Montana, Utah.

To calculate the cumulative COVID-19 impact on prisoner and prison worker populations, we aggregated prisoner and staff COVID case and death data up through Dec. 15. Because population snapshots do not account for movement in and out of prisons since March, and because many systems have significantly slowed the number of new people being sent to prison, it’s difficult to estimate the total number of people who have been held in a state system since March. To be conservative, we calculated our rates of infection using the largest prisoner population snapshots we had during this time period.

As with all COVID-19 data, our understanding of the spread and impact of the virus is limited by the availability of testing. Epidemiology and public health experts say that aside from a few states that have recently begun aggressively testing in prisons, it is likely that there are more cases of COVID-19 circulating undetected in facilities. Sixteen prison systems, including the Federal Bureau of Prisons, would not release information about how many prisoners they are testing.

Corrections departments in Indiana, Kansas, Montana, North Dakota and Wisconsin report coronavirus testing and case data for juvenile facilities; West Virginia reports figures for juvenile facilities and jails. For consistency of comparison with other state prison systems, we removed those facilities from our data that had been included prior to July 28. For these states we have also removed staff data. Similarly, Pennsylvania’s coronavirus data includes testing and cases for those who have been released on parole. We removed these tests and cases for prisoners from the data prior to July 28. The staff cases remain.

About the Data

There are four tables in this data:

covid_prison_cases.csv contains weekly time series data on tests, infections and deaths in prisons. The first dates in the table are on March 26. Any questions that a prison agency could not or would not answer are left blank.

prison_populations.csv contains snapshots of the population of people incarcerated in each of these prison systems for whom data on COVID testing and cases are available. This varies by state and may not always be the entire number of people incarcerated in each system. In some states, it may include other populations, such as those on parole or held in state-run jails. This data is primarily for use in calculating rates of testing and infection, and we would not recommend using these numbers to compare the change in how many people are being held in each prison system.

staff_populations.csv contains a one-time, recent snapshot of the headcount of workers for each prison agency, collected as close to April 15 as possible.

covid_prison_rates.csv contains the rates of cases and deaths for prisoners. There is one row for every state and federal prison system and an additional row with the National totals.

Queries

The Associated Press and The Marshall Project have created several queries to help you use this data:

Get your state's prison COVID data: Provides each week's data from just your state and calculates a cases-per-100000-prisoners rate, a deaths-per-100000-prisoners rate, a cases-per-100000-workers rate and a deaths-per-100000-workers rate here

Rank all systems' most recent data by cases per 100,000 prisoners here

Find what percentage of your state's total cases and deaths -- as reported by Johns Hopkins University -- occurred within the prison system here

Attribution

In stories, attribute this data to: “According to an analysis of state prison cases by The Marshall Project, a nonprofit investigative newsroom dedicated to the U.S. criminal justice system, and The Associated Press.”

Contributors

Many reporters and editors at The Marshall Project and The Associated Press contributed to this data, including: Katie Park, Tom Meagher, Weihua Li, Gabe Isman, Cary Aspinwall, Keri Blakinger, Jake Bleiberg, Andrew R. Calderón, Maurice Chammah, Andrew DeMillo, Eli Hager, Jamiles Lartey, Claudia Lauer, Nicole Lewis, Humera Lodhi, Colleen Long, Joseph Neff, Michelle Pitcher, Alysia Santo, Beth Schwartzapfel, Damini Sharma, Colleen Slevin, Christie Thompson, Abbie VanSickle, Adria Watson, Andrew Welsh-Huggins.

Questions

If you have questions about the data, please email The Marshall Project at info+covidtracker@themarshallproject.org or file a Github issue.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
A
‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by...
analyst-2.ai
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-covid-19-cases-by-population-characteristics-over-time-097d/6c8f14dd/?iid=004-510&v=presentation
Explore at:
Dataset updated
Feb 15, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘COVID-19 Cases by Population Characteristics Over Time’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/a3291d85-0076-43c5-a59c-df49480cdc6d on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Note: On January 22, 2022, system updates to improve the timeliness and accuracy of San Francisco COVID-19 cases and deaths data were implemented. You might see some fluctuations in historic data as a result of this change. Due to the changes, starting on January 22, 2022, the number of new cases reported daily will be higher than under the old system as cases that would have taken longer to process will be reported earlier.

A. SUMMARY This dataset shows San Francisco COVID-19 cases by population characteristics and by specimen collection date. Cases are included on the date the positive test was collected.

Population characteristics are subgroups, or demographic cross-sections, like age, race, or gender. The City tracks how cases have been distributed among different subgroups. This information can reveal trends and disparities among groups.

Data is lagged by five days, meaning the most recent specimen collection date included is 5 days prior to today. Tests take time to process and report, so more recent data is less reliable.

B. HOW THE DATASET IS CREATED Data on the population characteristics of COVID-19 cases and deaths are from: * Case interviews * Laboratories * Medical providers

These multiple streams of data are merged, deduplicated, and undergo data verification processes. This data may not be immediately available for recently reported cases because of the time needed to process tests and validate cases. Daily case totals on previous days may increase or decrease. Learn more.

Data are continually updated to maximize completeness of information and reporting on San Francisco residents with COVID-19.

Data notes on each population characteristic type is listed below.

Race/ethnicity * We include all race/ethnicity categories that are collected for COVID-19 cases. * The population estimates for the "Other" or “Multi-racial” groups should be considered with caution. The Census definition is likely not exactly aligned with how the City collects this data. For that reason, we do not recommend calculating population rates for these groups.

Sexual orientation * Sexual orientation data is collected from individuals who are 18 years old or older. These individuals can choose whether to provide this information during case interviews. Learn more about our data collection guidelines. * The City began asking for this information on April 28, 2020.

Gender * The City collects information on gender identity using these guidelines.

Comorbidities * Underlying conditions are reported when a person has one or more underlying health conditions at the time of diagnosis or death.

Transmission type * Information on transmission of COVID-19 is based on case interviews with individuals who have a confirmed positive test. Individuals are asked if they have been in close contact with a known COVID-19 case. If they answer yes, transmission category is recorded as contact with a known case. If they report no contact with a known case, transmission category is recorded as community transmission. If the case is not interviewed or was not asked the question, they are counted as unknown.

Homelessness Persons are identified as homeless based on several data sources: * self-reported living situation
* the location at the time of testing * Department of Public Health homelessness and health databases * Residents in Single-Room Occupancy hotels are not included in these figures.
These methods serve as an estimate of persons experiencing homelessness. They may not meet other homelessness definitions.

Skilled Nursing Facility (SNF) occupancy * A Skilled Nursing

--- Original source retains full ownership of the source dataset ---
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sousa, Emanuel (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Sousa, Emanuel
Margolis, Iara
Macedo, Nuno
Cunha, Alcino
Campos, José Creissac
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
Public electrophysiological datasets collected in the Buzsaki Lab
zenodo.org
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Christian Petersen; Peter Christian Petersen; Michelle Hernandez; György Buzsáki; György Buzsáki; Michelle Hernandez (2024). Public electrophysiological datasets collected in the Buzsaki Lab [Dataset]. http://doi.org/10.5281/zenodo.3629881
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3629881
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Christian Petersen; Peter Christian Petersen; Michelle Hernandez; György Buzsáki; György Buzsáki; Michelle Hernandez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Buzsaki Lab is proud to present a large selection of experimental data available for public access: https://buzsakilab.com/wp/database/. We publicly share more than a thousand sessions (about 40TB of raw and spike- and LFP-processed data) via our public data repository. The datasets are from freely moving rodents and include sleep-task-sleep sessions (3 to 24 hrs continuous recording sessions) in various brain structures, including metadata. We are happy to assist you in using the data. Our goal is that by sharing these data, other users can provide new insights, extend, contradict, or clarify our conclusions.

The databank contains electrophysiological recordings performed in freely moving rats and mice collected by investigators in the Buzsaki Lab over several years (a subset from head-fixed mice). Sessions have been collected with extracellular electrodes using high-channel-count silicon probes, with spike sorted single units, and intracellular and juxtacellular combined with extracellular electrodes. Several sessions include physiologically and optogenetically identified units. The sessions have been collected from various brain region pairs: the hippocampus, thalamus, amygdala, post-subiculum, septal region, and the entorhinal cortex, and various neocortical regions. In most behavioral tasks, the animals performed spatial behaviors (linear mazes and open fields), preceded and followed by long sleep sessions. Brain state classification is provided.

Getting started

The top menu “Databank” serves as a navigational menu to the databank. The metadata describing the experiments is stored in a relational database which means that there are many entry points for exploring the data. The databank is organized by projects, animal subjects, and sessions.

Accessing and downloading the datasets

We share the data through two services: our public Globus.org endpoint and our webshare: buzsakilab.nyumc.org. A subset of the datasets is also available at CRCNS.org. If you have an interest in a dataset that is not listed or is lacking information, please contact us. We pledge to make our data available immediately after publication.

Support

For support, please use our Buzsaki Databank google group. If you have an interest in a dataset that is not listed or is lacking information, please send us a request. Feel free to contact us, if you need more details on a given dataset or if a dataset is missing.
Average daily time spent on social media worldwide 2012-2025
statista.com
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Average daily time spent on social media worldwide 2012-2025 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
OpenCon Application Data
figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCon 2015; SPARC; Right to Research Coalition (2023). OpenCon Application Data [Dataset]. http://doi.org/10.6084/m9.figshare.1512496.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1512496.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
OpenCon 2015; SPARC; Right to Research Coalition
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OpenCon 2015 Application Open Data

The purpose of this document is to accompany the public release of data collected from OpenCon 2015 applications.Download & Technical Information The data can be downloaded in CSV format from GitHub here: https://github.com/RightToResearch/OpenCon-2015-Application-Data The file uses UTF8 encoding, comma as field delimiter, quotation marks as text delimiter, and no byte order mark.

License and Requests

This data is released to the public for free and open use under a CC0 1.0 license. We have a couple of requests for anyone who uses the data. First, we’d love it if you would let us know what you are doing with it, and share back anything you develop with the OpenCon community (#opencon / @open_con ). Second, it would also be great if you would include a link to the OpenCon 2015 website (www.opencon2015.org) wherever the data is used. You are not obligated to do any of this, but we’d appreciate it!

Data Fields

Unique ID

This is a unique ID assigned to each applicant. Numbers were assigned using a random number generator.

Timestamp

This was the timestamp recorded by google forms. Timestamps are in EDT (Eastern U.S. Daylight Time). Note that the application process officially began at 1:00pm EDT June 1 ended at 6:00am EDT on June 23. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. [a]

Gender

Mandatory. Choose one from list or fill-in other. Options provided: Male, Female, Other (fill in).

Country of Nationality

Mandatory. Choose one option from list.

Country of Residence

Mandatory. Choose one option from list.

What is your primary occupation?

Mandatory. Choose one from list or fill-in other. Options provided: Undergraduate student; Masters/professional student; PhD candidate; Faculty/teacher; Researcher (non-faculty); Librarian; Publisher; Professional advocate; Civil servant / government employee; Journalist; Doctor / medical professional; Lawyer; Other (fill in).

Select the option below that best describes your field of study or expertise

Mandatory. Choose one option from list.

What is your primary area of interest within OpenCon’s program areas?

Mandatory. Choose one option from list. Note: for the first approximately 24 hours the options were listed in this order: Open Access, Open Education, Open Data. After that point, we set the form to randomize the order, and noticed an immediate shift in the distribution of responses.

Are you currently engaged in activities to advance Open Access, Open Education, and/or Open Data?

Mandatory. Choose one option from list.

Are you planning to participate in any of the following events this year?

Optional. Choose all that apply from list. Multiple selections separated by semi-colon.

Do you have any of the following skills or interests?

Mandatory. Choose all that apply from list or fill-in other. Multiple selections separated by semi-colon. Options provided: Coding; Website Management / Design; Graphic Design; Video Editing; Community / Grassroots Organizing; Social Media Campaigns; Fundraising; Communications and Media; Blogging; Advocacy and Policy; Event Logistics; Volunteer Management; Research about OpenCon's Issue Areas; Other (fill-in).

Data Collection & Cleaning

This data consists of information collected from people who applied to attend OpenCon 2015. In the application form, questions that would be released as Open Data were marked with a caret (^) and applicants were asked to acknowledge before submitting the form that they understood that their responses to these questions would be released as such. The questions we released were selected to avoid any potentially sensitive personal information, and to minimize the chances that any individual applicant can be positively identified. Applications were formally collected during a 22 day period beginning on June 1, 2015 at 13:00 EDT and ending on June 23 at 06:00 EDT. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. Applications were collected using a Google Form embedded at http://www.opencon2015.org/attend, and the shortened bit.ly link http://bit.ly/AppsAreOpen was promoted through social media. The primary work we did to clean the data focused on identifying and eliminating duplicates. We removed all duplicate applications that had matching e-mail addresses and first and last names. We also identified a handful of other duplicates that used different e-mail addresses but were otherwise identical. In cases where duplicate applications contained any different information, we kept the information from the version with the most recent timestamp. We made a few minor adjustments in the country field for cases where the entry was obviously an error (for example, electing a country listed alphabetically above or below the one indicated elsewhere in the application). We also removed one potentially offensive comment (which did not contain an answer to the question) from the Gender field and replaced it with “Other.”

About OpenCon

OpenCon 2015 is the student and early career academic professional conference on Open Access, Open Education, and Open Data and will be held on November 14-16, 2015 in Brussels, Belgium. It is organized by the Right to Research Coalition, SPARC (The Scholarly Publishing and Academic Resources Coalition), and an Organizing Committee of students and early career researchers from around the world. The meeting will convene students and early career academic professionals from around the world and serve as a powerful catalyst for projects led by the next generation to advance OpenCon's three focus areas—Open Access, Open Education, and Open Data. A unique aspect of OpenCon is that attendance at the conference is by application only, and the majority of participants who apply are awarded travel scholarships to attend. This model creates a unique conference environment where the most dedicated and impactful advocates can attend, regardless of where in the world they live or their access to travel funding. The purpose of the application process is to conduct these selections fairly. This year we were overwhelmed by the quantity and quality of applications received, and we hope that by sharing this data, we can better understand the OpenCon community and the state of student and early career participation in the Open Access, Open Education, and Open Data movements.

Questions

For inquires about the OpenCon 2015 Application data, please contact Nicole Allen at nicole@sparc.arl.org.

Data from: Login Data Set for Risk-Based Authentication

zenodo.org
data.niaid.nih.gov

zip

Updated Jun 30, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6782156

Dataset updated

Jun 30, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

Feature	Data Type	Description	Range or Example
IP Address	String	IP address belonging to the login attempt	0.0.0.0 - 255.255.255.255
Country	String	Country derived from the IP address	US
Region	String	Region derived from the IP address	New York
City	String	City derived from the IP address	Rochester
ASN	Integer	Autonomous system number derived from the IP address	0 - 600000
User Agent String	String	User agent string submitted by the client	Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version	String	Operating system name and version derived from the user agent string	Windows 10
Browser Name and Version	String	Browser name and version derived from the user agent string	Chrome 70.0.3538
Device Type	String	Device type derived from the user agent string	(`mobile`, `desktop`, `tablet`, `bot`, `unknown`)¹
User ID	Integer	Idenfication number related to the affected user account	[Random pseudonym]
Login Timestamp	Integer	Timestamp related to the login attempt	[64 Bit timestamp]
Round-Trip Time (RTT) [ms]	Integer	Server-side measured latency between client and server	1 - 8600000
Login Successful	Boolean	`True`: Login was successful, `False`: Login failed	(`true`, `false`)
Is Attack IP	Boolean	IP address was found in known attacker data set	(`true`, `false`)
Is Account Takeover	Boolean	Login attempt was identified as account takeover by incident response team of the online service	(`true`, `false`)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022,
 author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
 title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
 journal = {{ACM} {Transactions} on {Privacy} and {Security}},
 doi = {10.1145/3546069},
 publisher = {ACM},
 year  = {2022}
}

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

Imgur Most Viral and Secret Santa

kaggle.com

Updated Apr 18, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Ghalib93 (2020). Imgur Most Viral and Secret Santa [Dataset]. https://www.kaggle.com/ghalib93/imgur-most-viral-and-secret-santa/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 18, 2020

Dataset provided by

Kaggle

Authors

Ghalib93

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Imgur is an image hosting and sharing website founded in 2009. It became one of the most popular websites around the world with approximately 250 million users. The website does not require registration and anyone can browse its content. However, to be able to post an account must be created. It is famous for an event that it created in 2013 where members get to register to send/receive gifts from other members on the website. The event takes place during Christmas time and people share their gifts via the website where they post pictures of the process or what they received in a specific tag. Today the data provided covers two sections that I think are important to understanding certain patterns within the Imgur community. The first is the Most Viral section and the second is the Secret Santa tag.

I have participated twice in The Imgur secret Santa event and always found funny and interesting post from its most viral section. I would like with the help of the Kaggle community to identify trends from the data provided and maybe make a comparison between the Secret Santa data and the most viral.

Content

There are two Dataframes included and they are almost identical in the number of columns:

The first Dataframe is Imgur Most Viral posts. This contains many of the posts that were labelled as Viral by The Imgur community and team using specific algorithms to track number of likes and dislikes across multiple platforms. The posts might be videos, gifs, pictures or just text.

The second Dataframe is Imgur Secret Santa Tag. Secret Santa is an annual Imgur tradition where members can sign up to send gifts to and receive gifts from other members during the Christmas holiday.This contains many of the posts that were tagged with Secret Santa by the Imgur community. The posts might be videos, gifs, pictures or just text. There is a (is_viral) column in this Dataframe that is not available in the Most Viral Dataframe since all of the posts there are viral.

Data Dictionary

Feature	Type	Dataset	Description
account_id	object	Imgur_Viral/imgur_secret_santa	Unique Account ID per member
comment_count	float64	Imgur_Viral/imgur_secret_santa	Number of comments made in the post
datetime	float64	Imgur_Viral/imgur_secret_santa	TimeStamp containing Date and Time Details
downs	float64	Imgur_Viral/imgur_secret_santa	Number of dislikes for the post
favorite_count	float64	Imgur_Viral/imgur_secret_santa	Number of user that marked the post as a favourite
id	object	Imgur_Viral/imgur_secret_santa	Uniqe Post ID. Even if it was posted by the same member, different posts will have different IDs
images_count	float64	Imgur_Viral/imgur_secret_santa	Number of images included in the post
points	float64	Imgur_Viral/imgur_secret_santa	Each post will have calculated points based on (ups - downs)
score	float64	Imgur_Viral/imgur_secret_santa	Ticket number
tags	object	Imgur_Viral/imgur_secret_santa	Tags are sub albums that the post will show under
title	object	Imgur_Viral/imgur_secret_santa	Title of the post
ups	float64	Imgur_Viral/imgur_secret_santa	Number of likes for the post
views	float64	Imgur_Viral/imgur_secret_santa	Number of people that viewed the post
is_most_viral	boolean	imgur_secret_santa	If the post is viral or not

Acknowledgements

I would like to thank imgur for providing an API that made collecting data easier from its website. With their help we might be able to better understand certain trends that emerge from its community

Inspiration

There is no problem to solve from this data, but it just a fun way to explore and learn more about programming and analyzing data. I hope you enjoy playing with the data as much as I did collecting it and browsing the website

Number of global social network users 2017-2028

statista.com
es.statista.com

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Number of global social network users 2017-2028 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How many people use social media?

              Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.

              Who uses social media?
              Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
              when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.

              How much time do people spend on social media?
              Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.

              What are the most popular social media platforms?
              Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.

d
Data from: Zircon U-Pb and 40Ar/39Ar results from dikes collected within the...
catalog.data.gov
datasets.ai
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaska Division of Geological & Geophysical Surveys (Point of Contact) (2023). Zircon U-Pb and 40Ar/39Ar results from dikes collected within the Bruin Bay fault zone, lower Cook Inlet [Dataset]. https://catalog.data.gov/dataset/zircon-u-pb-and-40ar-39ar-results-from-dikes-collected-within-the-bruin-bay-fault-zone-lower-co2
Explore at:
Dataset updated
Jul 5, 2023
Dataset provided by
Alaska Division of Geological & Geophysical Surveys (Point of Contact)
Area covered
Cook Inlet
Description
This data release presents zircon U-Pb LA-ICP-MS geochronology from igneous rocks exposed in a coastal exposure at Ursus Head, Iliamna C-2 Quadrangle. Sills and dikes exposed in a coastal exposure at Ursus Head are deformed within the Bruin Bay fault zone. Common dikes intruding late Triassic Kamishak Formation strata composing the hanging-wall of the Bruin Bay fault are cut by numerous small-scale, low-angle, bedding-parallel, and high angle contractional faults. Sample 09MAW006A collected from one of many dikes deformed in the fault zone produced a uni-modal distribution of zircon ages of with a weighted mean of 206.92 +/-0.96 (2 sigma, n=47) and an MSWD of 1.1. Sample 09BG020C collected from a sill of more mafic composition yielded few zircons and produced only eight ages with a weighted mean of 212.49 +/-2.28 (2 sigma) and an MSWD of 3.42. The accompanying report also presents 40Ar/39Ar geochronology results from rocks collected roughly 7 km southwest of Contact Point, Iliamna B-3 Quadrangle.
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
opendatalab.com
+6more
application/rdfxml +5
Updated Jul 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
Explore at:
application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Share of companies collecting personal data 2021, by data subject region [Dataset]. https://www.statista.com/statistics/1172965/firms-collecting-personal-data/

Share of companies collecting personal data 2021, by data subject region

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jun 23, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2021

Area covered

Worldwide

Description

A 2021 poll conducted among privacy experts worldwide showed that ** percent of companies collected personal data of subjects living in the EU, while ** percent of firms did the same for individuals living in Canada. A further ** percent of survey respondents stated that their companies collected personal data from identifiable subjects in the United Kingdom.

Clear search

Close search

Google apps

Main menu

Share of companies collecting personal data 2021, by data subject region

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

US Broadband Usage Across Counties

US Broadband Usage Across Counties

Utilizing Microsoft's Data to Estimate Access

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use the US Broadband Usage Dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

The Global Cost of 1GB of Mobile Data

The Global Cost of 1GB of Mobile Data

How Much Does it Really Cost?

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

Internet of Things - number of connected devices worldwide 2015-2025

Company Datasets for Business Profiling

Data from: ipaast project - community stakeholder survey data and basic...

The Marshall Project: COVID Cases in Prisons

Overview

Methodology and Caveats

About the Data

Queries

Attribution

Contributors

Questions

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

‘COVID-19 Cases by Population Characteristics Over Time’ analyzed by...

Assessing the impact of hints in learning formal specification: Research...

Public electrophysiological datasets collected in the Buzsaki Lab

Average daily time spent on social media worldwide 2012-2025

OpenCon Application Data

OpenCon 2015 Application Open Data

License and Requests

Data Fields

Data Collection & Cleaning

About OpenCon

Questions

Data from: Login Data Set for Risk-Based Authentication

Imgur Most Viral and Secret Santa

Context

Content

Data Dictionary

Acknowledgements

Inspiration

Number of global social network users 2017-2028

Data from: Zircon U-Pb and 40Ar/39Ar results from dikes collected within the...

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Share of companies collecting personal data 2021, by data subject region