30 datasets found

Most popular database management systems worldwide 2024
statista.com
Updated Aug 14, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2015). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
Explore at:
Dataset updated
Aug 14, 2015
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2024
Area covered
Worldwide
Description
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
Fantastic databases and where to find them: Web applications for researchers...
scielo.figshare.com
jpeg
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerda Cristal Villalba; Ursula Matte (2023). Fantastic databases and where to find them: Web applications for researchers in a rush [Dataset]. http://doi.org/10.6084/m9.figshare.20018091.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20018091.v1
Dataset updated
Jun 3, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Gerda Cristal Villalba; Ursula Matte
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract Public databases are essential to the development of multi-omics resources. The amount of data created by biological technologies needs a systematic and organized form of storage, that can quickly be accessed, and managed. This is the objective of a biological database. Here, we present an overview of human databases with web applications. The databases and tools allow the search of biological sequences, genes and genomes, gene expression patterns, epigenetic variation, protein-protein interactions, variant frequency, regulatory elements, and comparative analysis between human and model organisms. Our goal is to provide an opportunity for exploring large datasets and analyzing the data for users with little or no programming skills. Public user-friendly web-based databases facilitate data mining and the search for information applicable to healthcare professionals. Besides, biological databases are essential to improve biomedical search sensitivity and efficiency and merge multiple datasets needed to share data and build global initiatives for the diagnosis, prognosis, and discovery of new treatments for genetic diseases. To show the databases at work, we present a a case study using ACE2 as example of a gene to be investigated. The analysis and the complete list of databases is available in the following website .
Leading big data vendors in 2014-2017, by revenue
statista.com
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Leading big data vendors in 2014-2017, by revenue [Dataset]. https://www.statista.com/statistics/254271/big-data-revenue-by-leading-vendors/
Explore at:
Dataset updated
May 23, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
This statistic shows the revenues from the leading big data vendors from 2014 to 2017. In 2017, IBM generated around 2.66 billion U.S. dollars worth of revenue through big data services, software and hardware.
m
Student Skill Gap Analysis
data.mendeley.com
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bindu Garg (2025). Student Skill Gap Analysis [Dataset]. http://doi.org/10.17632/rv6scbpd7v.1
Explore at:
Unique identifier
https://doi.org/10.17632/rv6scbpd7v.1
Dataset updated
Apr 28, 2025
Authors
Bindu Garg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is designed for skill gap analysis, focusing on evaluating the skill gap between students’ current skills and industry requirements. It provides insights into technical skills, soft skills, career interests, and challenges, helping in skill gap analysis to identify areas for improvement.

By leveraging this dataset, educators, recruiters, and researchers can conduct skill gap analysis to assess students’ job readiness and tailor training programs accordingly. It serves as a valuable resource for identifying skill deficiencies and skill gaps improving career guidance, and enhancing curriculum design through targeted skill gap analysis.

Following is the column descriptors: Name - Student's full name. email_id - Student's email address. Year - The academic year the student is currently in (e.g., 1st Year, 2nd Year, etc.). Current Course - The course the student is currently pursuing (e.g., B.Tech CSE, MBA, etc.). Technical Skills - List of technical skills possessed by the student (e.g., Python, Data Analysis, Cloud Computing). Programming Languages - Programming languages known by the student (e.g., Python, Java, C++). Rating - Self-assessed rating of technical skills on a scale of 1 to 5. Soft Skills - List of soft skills (e.g., Communication, Leadership, Teamwork). Rating - Self-assessed rating of soft skills on a scale of 1 to 5. Projects - Indicates whether the student has worked on any projects (Yes/No). Career Interest - The student's preferred career path (e.g., Data Scientist, Software Engineer). Challenges - Challenges faced while applying for jobs/internships (e.g., Lack of experience, Resume building issues).
Data from: Current and projected research data storage needs of Agricultural...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Current and projected research data storage needs of Agricultural Research Service researchers in 2016 [Dataset]. https://catalog.data.gov/dataset/current-and-projected-research-data-storage-needs-of-agricultural-research-service-researc-f33da
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11711230
Dataset updated
Jul 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 15, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset:

N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

Abstract

This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
Data from: Edtech in Higher Education: Focus Groups, Database, and Documents...
beta.ukdataservice.ac.uk
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2023). Edtech in Higher Education: Focus Groups, Database, and Documents on Edtech Companies, Investors and Universities, 2021-2023 [Dataset]. http://doi.org/10.5255/ukda-sn-856729
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-856729
Dataset updated
2023
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Description
These data were generated as part of a two-and-a-half-year ESRC-funded research project examining the digitalisation of higher education (HE) and the educational technology (Edtech) industry in HE. Building on a theoretical lens of assetisation, it focused on forms of value in the sector, and governance challenges of digital data. It followed three groups of actors: UK universities, Edtech companies, and investors in Edtech. The researchers first sought to develop an overview of the Edtech industry in HE by building three databases on Edtech companies, investors in Edtech, and investment deals, using data downloaded from Crunchbase, a proprietary platform. Due to Crunchbase’s Terms of Service, only parts of one database are allowed to be submitted to this repository, i.e. a list of companies with the project’s classification. A report offering descriptive analysis of all three databases was produced and is submitted as well. A qualitative discursive analysis was conducted by analysing seven documents in depth. In the second phase, researchers conducted interviews with participants representing three groups of actors (n=43) and collected documents on their organisations. Moreover, a list of documents collected from Big Tech (Microsoft, Amazon, and Salesforce) were collected to contextualise the role of global digital infrastructure in HE. Due to commercial sensitivity, only lists of documents collected about investors and Big Tech are submitted to the repository. Researchers then conducted focus groups (n=6) with representatives of universities (n=19). The dataset includes transcripts of focus groups and outputs of writing by participants during the focus group. Finally, a public consultation was held via a survey, and 15 participants offered qualitative answers.
d
B2B Leads Database | 500M+ B2B Contact Profiles | 100M+ B2B Mobile Numbers |...
datarade.ai
.csv, .xls
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lead for Business (2022). B2B Leads Database | 500M+ B2B Contact Profiles | 100M+ B2B Mobile Numbers | 100% Real-Time Verified Contact Data [Dataset]. https://datarade.ai/data-products/b2b-leads-database-b2b-contact-database-b2b-contact-direc-lead-for-business
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Feb 24, 2022
Dataset authored and provided by
Lead for Business
Area covered
Jersey, Finland, Martinique, Mozambique, Palestine, Trinidad and Tobago, Armenia, South Sudan, Northern Mariana Islands, Isle of Man
Description
• 500M B2B Contacts • 35M Companies • 20+ Data Points to Filter Your Leads • 100M+ Contact Direct Dial and Mobile Number • Lifetime Support Until You 100% Satisfied

We are the Best b2b database providers for high-performance sales teams. If you get a fake by any chance, you have nothing to do with them. Nothing is more frustrating than receiving useless data for which you have paid money.

Every 15 days, our devoted team updates our b2b leads database. In addition, we are always available to assist our clients with whatever data they are working with in order to ensure that our service meets their needs. We keep an eye on our b2b contact database to keep you informed and provide any assistance you require.

With our simple-to-use system and up-to-date B2B contact list, we hope to make your job easier. You’ll be able to filter your data at Lfbbd based on the industry you work in. For example, you can choose from real estate companies or just simply tap into the healthcare business. Our database is updated on a regular basis, and you will receive contact information as soon as possible.

Use our information to quickly locate new business clients, competitors, and suppliers. We’ve got your back, no matter what precise requirements you have.

We have over 500 million business-to-business contacts that you may segment based on your marketing and commercial goals. We don’t stop there; we’re always gathering leads from the right tool so you can reach out to a big database of your clients without worrying about email constraints.

Thanks to our database, you may create your own campaign and send as many email or automated messages as you want. We collect the most viable b2b database to help you go a long way, as we seek to increase your business and enhance your sales.

The majority of our clients choose us since we have competitive costs when compared to others. In this digital era, marketing is more advanced, and customers are less willing to pay more for a service that produces poor results.

That’s why we’ve devised the most effective b2b database strategy for your company. You can also tailor your database and pricing to meet your specific business requirements.

• Connect directly with the right decision-makers, using the most accurate database of emails and direct dials. Build a clean prospecting list that you can plug into your sales tools and generate new leads from, right away • Over 500 million business contacts worldwide. • You could filter your targeted leads by 20+ criteria including job title, industry, location, Revenue, Technology, and more. • Find the email addresses of the professionals you want to contact one by one or in bulk.
Database of patient reviews expressing dissatisfaction with the quality of...
zenodo.org
bin
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irina Kalabikhina; Irina Kalabikhina; Anton Kolotusha; Anton Kolotusha (2025). Database of patient reviews expressing dissatisfaction with the quality of medical services in Russia in 2012-2023 [Dataset]. http://doi.org/10.5281/zenodo.15257447
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15257447
Dataset updated
Apr 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Irina Kalabikhina; Irina Kalabikhina; Anton Kolotusha; Anton Kolotusha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Russia
Description
Data format and access

The database consists of full-text patient reviews, reflecting their dissatisfaction with healthcare quality. Materials in Russian have been posted in the «Review list» of the site infodoctor.ru. Publication period: July 2012 to August 2023. The database consists of 18,492 reviews covering 16 Russian cities with population of over one million. Data format: .xlsx.

Data access: 10.5281/zenodo.15257447

Data collection methodology

Based on the fact that negative reviews may be more reliable than positive ones, the authors carried out negative reviews from 16 Russian cities with a population of over one million, for which it was possible to collect representative samples (at least 1000 reviews for each city). We have extracted reviews from the one-star section of this site's guestbook, as they are reliably identified as negative. Duplicates were removed from the database. Personal data in comment texts have been replaced with "##########". The author's gender was determined manually based on his/her name or gender endings in the texts of reviews. Otherwise, we indicated "0" - gender cannot be determined.

For Moscow reviews, classification was carried out using manual markup methods - based on the majority of votes for the review class from 3 annotators (if at least one annotator indicated that it was impossible to determine, the review was classified as #N/A - impossible to clearly determine). For reviews from other cities, classification was made into 3 classes using machine learning methods based on logistic regression. The classification accuracy was 88%.

The medical specialties were distributed into large groups for the convenience of further analysis. The correspondence of medical specialties to large groups is presented in detail in Appendix 1.

Sample structure and description of variables

· CITY – the name of a city with a population of over a million (on a separate sheet – Moscow), the other 15 are Volgograd, Voronezh, Yekaterinburg, Kazan, Krasnodar, Krasnoyarsk, Nizhny Novgorod, Novosibirsk, Omsk, Perm, Rostov-on-Don, Samara, St. Petersburg, Ufa, Chelyabinsk

· TEXT – review text

· GENDER – gender of the review author (2 – female, 1 – male, 0 – cannot be determined)

· CLASS_1 – group of reasons for dissatisfaction with medical care (M – issues of medical content, O – issues of organizational support and economic aspect, C – mixed (combined) class, #N/A – cannot be clearly determined)[1]

· CLASS_2 – group of reasons for dissatisfaction with medical care (0 – issues of medical content, 1 – issues of organizational support and economic aspect, 2 – mixed (combined) class, #N/A – cannot be clearly determined)

· DAY – day of the month the review was posted

· MONTH – month the review was posted

· YEAR – year the review was posted

· DOCTOR_OR_CLINIC – what or who is the review dedicated to – the doctor or the clinic

· SPEC – physician specialty (for observations where the review is dedicated to the physician)

· GROUP_SPEC – a large group of a physician’s specialty

· ID – observation identifier

Database application

The data are suitable for analyzing patient dissatisfaction trends with medical services in Russia over the period from July 2012 to August 2023. This dataset could be particularly useful for healthcare providers, policymakers, and researchers interested in understanding patient experiences and identifying areas for quality improvement in Russian healthcare. Some potential applications include:

Analyzing geographic patterns of patient complaints across different cities in Russia

Examining trends in patient dissatisfaction over time

Identifying common reasons for dissatisfaction with medical care

Comparing dissatisfaction levels between different medical specialties

Assessing gender differences in patient complaints

The database provides rich qualitative data through full-text review texts, allowing for in-depth analysis of patient experiences. The structured variables like city, date, doctor/clinic information, etc. enable quantitative analysis as well. This combination of qualitative and quantitative data makes it possible to gain a comprehensive understanding of patient dissatisfaction patterns in Russia's healthcare system over more than a decade.

For researchers specifically interested in healthcare quality issues, this dataset could serve as an important resource for studying patient experiences and outcomes in Russia's medical system. The longitudinal nature of the data (2012-2023) also allows for analysis of changes over time in patient satisfaction.

Overall, this database provides valuable insights into patient perceptions of healthcare quality that could inform policy decisions, quality improvement

[1] We divided the variable-indicator of the group of reasons for dissatisfaction with medical care into 2 options - with letter (CLASS_1) and numeric codes (CLASS_2) (for the convenience of possible use of data in the work)
f
List of ROAD user groups and their capabilities.
plos.figshare.com
bin
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew W. Kandel; Christian Sommer; Zara Kanaeva; Michael Bolus; Angela A. Bruch; Claudia Groth; Miriam N. Haidle; Christine Hertler; Julia Heß; Maria Malina; Michael Märker; Volker Hochschild; Volker Mosbrugger; Friedemann Schrenk; Nicholas J. Conard (2023). List of ROAD user groups and their capabilities. [Dataset]. http://doi.org/10.1371/journal.pone.0289513.t001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289513.t001
Dataset updated
Aug 1, 2023
Dataset provided by
PLOS ONE
Authors
Andrew W. Kandel; Christian Sommer; Zara Kanaeva; Michael Bolus; Angela A. Bruch; Claudia Groth; Miriam N. Haidle; Christine Hertler; Julia Heß; Maria Malina; Michael Märker; Volker Hochschild; Volker Mosbrugger; Friedemann Schrenk; Nicholas J. Conard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Group 1 can access ROAD without a login, while groups 2–4 require a user id and password to log in.
Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-cleaning-tools-market
Explore at:
pptx, pdf, csvAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Cleaning Tools Market Outlook

As of 2023, the global market size for data cleaning tools is estimated at $2.5 billion, with projections indicating that it will reach approximately $7.1 billion by 2032, reflecting a robust CAGR of 12.1% during the forecast period. This growth is primarily driven by the increasing importance of data quality in business intelligence and analytics workflows across various industries.

The growth of the data cleaning tools market can be attributed to several critical factors. Firstly, the exponential increase in data generation across industries necessitates efficient tools to manage data quality. Poor data quality can result in significant financial losses, inefficient business processes, and faulty decision-making. Organizations recognize the value of clean, accurate data in driving business insights and operational efficiency, thereby propelling the adoption of data cleaning tools. Additionally, regulatory requirements and compliance standards also push companies to maintain high data quality standards, further driving market growth.

Another significant growth factor is the rising adoption of AI and machine learning technologies. These advanced technologies rely heavily on high-quality data to deliver accurate results. Data cleaning tools play a crucial role in preparing datasets for AI and machine learning models, ensuring that the data is free from errors, inconsistencies, and redundancies. This surge in the use of AI and machine learning across various sectors like healthcare, finance, and retail is driving the demand for efficient data cleaning solutions.

The proliferation of big data analytics is another critical factor contributing to market growth. Big data analytics enables organizations to uncover hidden patterns, correlations, and insights from large datasets. However, the effectiveness of big data analytics is contingent upon the quality of the data being analyzed. Data cleaning tools help in sanitizing large datasets, making them suitable for analysis and thus enhancing the accuracy and reliability of analytics outcomes. This trend is expected to continue, fueling the demand for data cleaning tools.

In terms of regional growth, North America holds a dominant position in the data cleaning tools market. The region's strong technological infrastructure, coupled with the presence of major market players and a high adoption rate of advanced data management solutions, contributes to its leadership. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period. The rapid digitization of businesses, increasing investments in IT infrastructure, and a growing focus on data-driven decision-making are key factors driving the market in this region.

As organizations strive to maintain high data quality standards, the role of an Email List Cleaning Service becomes increasingly vital. These services ensure that email databases are free from invalid addresses, duplicates, and outdated information, thereby enhancing the effectiveness of marketing campaigns and communications. By leveraging sophisticated algorithms and validation techniques, email list cleaning services help businesses improve their email deliverability rates and reduce the risk of being flagged as spam. This not only optimizes marketing efforts but also protects the reputation of the sender. As a result, the demand for such services is expected to grow alongside the broader data cleaning tools market, as companies recognize the importance of maintaining clean and accurate contact lists.

Component Analysis

The data cleaning tools market can be segmented by component into software and services. The software segment encompasses various tools and platforms designed for data cleaning, while the services segment includes consultancy, implementation, and maintenance services provided by vendors.

The software segment holds the largest market share and is expected to continue leading during the forecast period. This dominance can be attributed to the increasing adoption of automated data cleaning solutions that offer high efficiency and accuracy. These software solutions are equipped with advanced algorithms and functionalities that can handle large volumes of data, identify errors, and correct them without manual intervention. The rising adoption of cloud-based data cleaning software further bolsters this segment, as it offers scalability and ease of
P
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)...
paperswithcode.com
Updated Sep 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li (2024). BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) Dataset [Dataset]. https://paperswithcode.com/dataset/bird-sql
Explore at:
Dataset updated
Sep 24, 2024
Authors
Jinyang Li; Binyuan Hui; Ge Qu; Jiaxi Yang; Binhua Li; Bowen Li; Bailin Wang; Bowen Qin; Rongyu Cao; Ruiying Geng; Nan Huo; Xuanhe Zhou; Chenhao Ma; Guoliang Li; Kevin C. C. Chang; Fei Huang; Reynold Cheng; Yongbin Li
Description
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs and 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
f
List of the number of localities and assemblages entered in ROAD.
plos.figshare.com
bin
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew W. Kandel; Christian Sommer; Zara Kanaeva; Michael Bolus; Angela A. Bruch; Claudia Groth; Miriam N. Haidle; Christine Hertler; Julia Heß; Maria Malina; Michael Märker; Volker Hochschild; Volker Mosbrugger; Friedemann Schrenk; Nicholas J. Conard (2023). List of the number of localities and assemblages entered in ROAD. [Dataset]. http://doi.org/10.1371/journal.pone.0289513.t002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289513.t002
Dataset updated
Aug 1, 2023
Dataset provided by
PLOS ONE
Authors
Andrew W. Kandel; Christian Sommer; Zara Kanaeva; Michael Bolus; Angela A. Bruch; Claudia Groth; Miriam N. Haidle; Christine Hertler; Julia Heß; Maria Malina; Michael Märker; Volker Hochschild; Volker Mosbrugger; Friedemann Schrenk; Nicholas J. Conard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of the number of localities and assemblages entered in ROAD.
d
Location Data | 3.5M+ Points of Interest (POI) in US and Canada | Places...
datarade.ai
Updated Nov 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xtract (2022). Location Data | 3.5M+ Points of Interest (POI) in US and Canada | Places Data | Comprehensive Coverage [Dataset]. https://datarade.ai/data-products/poi-data-locations-data-us-and-canada-xtract
Explore at:
.json, .xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Nov 14, 2022
Dataset authored and provided by
Xtract
Area covered
Canada, United States
Description
Xtract.io's massive point-of-interest database represents a transformative resource for location intelligence across the United States and Canada. Big data analysts, market researchers, and strategic planners can utilize these comprehensive location insights to develop sophisticated market strategies, conduct advanced spatial analysis, and gain a deep understanding of regional geographical landscapes.

Point of Interest (POI) data, also known as places data, provides the exact location of buildings, stores, or specific places. It has become essential for businesses to make smarter, geography-driven decisions in today's competitive landscape.

LocationsXYZ, the POI data product from Xtract.io, offers a comprehensive database of 6 million locations across the US, UK, and Canada, spanning 11 diverse industries, including:

-Retail -Restaurants -Healthcare -Automotive -Public utilities (e.g., ATMs, park-and-ride locations) -Shopping malls, and more

Why Choose LocationsXYZ? At LocationsXYZ, we: -Deliver POI data with 95% accuracy -Refresh POIs every 30, 60, or 90 days to ensure the most recent information -Create on-demand POI datasets tailored to your specific needs -Handcraft boundaries (geofences) for locations to enhance accuracy -Provide POI and polygon data in multiple file formats

Unlock the Power of POI Data With our point-of-interest data, you can: -Perform thorough market analyses -Identify the best locations for new stores -Gain insights into consumer behavior -Achieve an edge with competitive intelligence

LocationsXYZ has empowered businesses with geospatial insights, helping them scale and make informed decisions. Join our growing list of satisfied customers and unlock your business's potential with our cutting-edge POI data.
TetrapodTraits Database
zenodo.org
csv, zip
Updated Oct 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario R. Moura; Mario R. Moura; Karoline Ceron; Karoline Ceron; Jhonny J. M. Guedes; Jhonny J. M. Guedes; Rosana Chen-Zhao; Rosana Chen-Zhao; Yanina Sica; Yanina Sica; Julie Hart; Julie Hart; Wendy Dorman; Wendy Dorman; Julia M. Portmann; Julia M. Portmann; Pamela Gonzalez-del-Pliego; Pamela Gonzalez-del-Pliego; Ajay Ranipeta; Ajay Ranipeta; Alessandro Catenazzi; Alessandro Catenazzi; Fernanda Werneck; Fernanda Werneck; Luis Felipe Toledo; Luis Felipe Toledo; Nathan Upham; Nathan Upham; Joao F. R. Tonini; Joao F. R. Tonini; Timothy J. Colston; Timothy J. Colston; Robert Guralnick; Robert Guralnick; Rauri C. K. Bowie; Rauri C. K. Bowie; R. Alexander Pyron; R. Alexander Pyron; Walter Jetz; Walter Jetz (2024). TetrapodTraits Database [Dataset]. http://doi.org/10.5281/zenodo.11303604
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11303604
Dataset updated
Oct 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mario R. Moura; Mario R. Moura; Karoline Ceron; Karoline Ceron; Jhonny J. M. Guedes; Jhonny J. M. Guedes; Rosana Chen-Zhao; Rosana Chen-Zhao; Yanina Sica; Yanina Sica; Julie Hart; Julie Hart; Wendy Dorman; Wendy Dorman; Julia M. Portmann; Julia M. Portmann; Pamela Gonzalez-del-Pliego; Pamela Gonzalez-del-Pliego; Ajay Ranipeta; Ajay Ranipeta; Alessandro Catenazzi; Alessandro Catenazzi; Fernanda Werneck; Fernanda Werneck; Luis Felipe Toledo; Luis Felipe Toledo; Nathan Upham; Nathan Upham; Joao F. R. Tonini; Joao F. R. Tonini; Timothy J. Colston; Timothy J. Colston; Robert Guralnick; Robert Guralnick; Rauri C. K. Bowie; Rauri C. K. Bowie; R. Alexander Pyron; R. Alexander Pyron; Walter Jetz; Walter Jetz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

Tetrapods (amphibians, reptiles, birds and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biassed inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by non-random missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.

Additional Information: This work is output of the VertLife project. To flag erros, provide updates, or leave other comments, please go to vertlife.org. We aim to develop the database into a living resource at vertlife.org and your feedback is essential to improve data quality and support community use.

Version 1.0.1 (25 May 2024). This minor release addresses a spelling error in the file Tetrapod_360.csv. The error involves replacing white-space characters with underscore characters in the field Scientific.Name to match the spelling used in the file TetrapodTraits_1.0.0.csv. These corrections affect only 102 species considered extinct and 13 domestic species (Bos_frontalis, Bos_grunniens, Bos_indicus, Bos_taurus, Camelus_bactrianus, Camelus_dromedarius, Capra_hircus, Cavia_porcellus, Equus_caballus, Felis_catus, Lama_glama, Ovis_aries, Vicugna_pacos). All extinct and domestic species in TetrapodTraits have their binomial names separated by underscore symbols instead of white space. Additionally, we have added the file GridCellShapefile.zip, which contains the shapefile required to map species presence across the 110 × 110 km equal area grid cells (this file was previously provided through an External Source here).

Version 1.0.0 (19 April 2024). TetrapodTraits, the full phylogenetically coherent database we developed, is being made publicly available to support a range of research applications in ecology, evolution, and conservation and to help minimise the impacts of biassed data in this model system. The database includes 24 species-level attributes linked to their respective sources across 33,281 tetrapod species. Specific fields clearly label data sources and imputations in the TetrapodTraits, while additional tables record the 10K values per missing entry per species.

Taxonomy – includes 8 attributes that inform scientific names and respective higher-level taxonomic ranks, authority name, and year of species description. Field names: Scientific.Name, Genus, Family, Suborder, Order, Class, Authority, and YearOfDescription.

Phylogenetic tree – includes 2 attributes that notify which fully-sampled phylogeny contains the species, along with whether the species placement was imputed or not in the phylogeny. Field names: TreeTaxon, TreeImputed.

Body size – includes 7 attributes that inform length, mass, and data sources on species sizes, and details on the imputation of species length or mass. Field names: BodyLength_mm, LengthMeasure, ImputedLength, SourceBodyLength, BodyMass_g, ImputedMass, SourceBodyMass.

Activity time – includes 5 attributes that describe period of activity (e.g., diurnal, fossorial) as dummy (binary) variables, data sources, details on the imputation of species activity time, and a nocturnality score. Field names: Diu, Noc, ImputedActTime, SourceActTime, Nocturnality.

Microhabitat – includes 8 attributes covering habitat use (e.g., fossorial, terrestrial, aquatic, arboreal, aerial) as dummy (binary) variables, data sources, details on the imputation of microhabitat, and a verticality score. Field names: Fos, Ter, Aqu, Arb, Aer, ImputedHabitat, SourceHabitat, Verticality.

Macrohabitat – includes 19 attributes that reflect major habitat types according to the IUCN classification, the sum of major habitats, data source, and details on the imputation of macrohabitat. Field names: MajorHabitat_1 to MajorHabitat_10, MajorHabitat_12 to MajorHabitat_17, MajorHabitatSum, ImputedMajorHabitat, SourceMajorHabitat. MajorHabitat_11, representing the marine deep ocean floor (unoccupied by any species in our database), is not included here.

Ecosystem – includes 6 attributes covering species ecosystem (e.g., terrestrial, freshwater, marine) as dummy (binary) variables, the sum of ecosystem types, data sources, and details on the imputation of ecosystem. Field names: EcoTer, EcoFresh, EcoMar, EcosystemSum, ImputedEcosystem, SourceEcosystem.

Threat status – includes 3 attributes that inform the assessed threat statuses according to IUCN red list and related literature. Field names: IUCN_Binomial, AssessedStatus, SourceStatus.

RangeSize – the number of 110×110 grid cells covered by the species range map. Data derived from MOL.

Latitude – coordinate centroid of the species range map.

Longitude – coordinate centroid of the species range map.

Biogeography – includes 8 attributes that present the proportion of species range within each WWF biogeographical realm. Field names: Afrotropic, Australasia, IndoMalay, Nearctic, Neotropic, Oceania, Palearctic, Antarctic.

Insularity – includes 2 attributes that notify if a species is insular endemic (binary, 1 = yes, 0 = no), followed by the respective data source. Field names: Insularity, SourceInsularity.

AnnuMeanTemp – Average within-range annual mean temperature (Celsius degree). Data derived from CHELSA v. 1.2.

AnnuPrecip – Average within-range annual precipitation (mm). Data derived from CHELSA v. 1.2.

TempSeasonality – Average within-range temperature seasonality (Standard deviation × 100). Data derived from CHELSA v. 1.2.

PrecipSeasonality – Average within-range precipitation seasonality (Coefficient of Variation). Data derived from CHELSA v. 1.2.

Elevation – Average within-range elevation (metres). Data derived from topographic layers in EarthEnv.

ETA50K – Average within-range estimated time to travel to cities with a population >50K in the year 2015. Data from Nelson et al. (2019).

HumanDensity – Average within-range human population density in 2017. Data derived from HYDE v. 3.2.

PropUrbanArea – Proportion of species range map covered by built-up area, such as towns, cities, etc. at year 2017. Data derived from HYDE v. 3.2.

PropCroplandArea – Proportion of species range map covered by cropland area, identical to FAO's category 'Arable land and permanent crops' at year 2017. Data derived from HYDE v. 3.2.

PropPastureArea – Proportion of species range map covered by cropland, defined as Grazing land with an aridity index > 0.5, assumed to be more intensively managed (converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

PropRangelandArea – Proportion of species range map covered by rangeland, defined as Grazing land with an aridity index < 0.5, assumed to be less or not managed (not converted in climate models) at year 2017. Data derived from HYDE v. 3.2.

File content

All files use UTF-8 encoding.

ImputedSets.zip – the phylogenetic multiple imputation framework applied to the TetrapodTraits database produced 10,000 imputed values per missing data entry (= 100 phylogenetic trees x 10 validation-folds x 10 multiple imputations). These imputations were specifically developed for four fundamental natural history traits: Body length, Body mass, Activity time, and Microhabitat. To facilitate the evaluation of each imputed value in a user-friendly format, we offer 10,000 tables containing both observed and imputed data for the 33,281 species in the TetrapodTraits database. Each table encompasses information about the four targeted natural history traits, along with designated fields (e.g., ImputedMass) that clearly indicate whether the trait value provided (e.g., BodyMass_g) corresponds to observed (e.g., ImputedMass = 0) or imputed (e.g., ImputedMass = 1)
n
Data from: neurodata
neuinfo.org
dknet.org
Updated Oct 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). neurodata [Dataset]. http://identifiers.org/RRID:SCR_014264/resolver/mentions
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_014264 https://identifiers.org/RRID:SCR_014264/resolver/mentions
Dataset updated
Oct 16, 2019
Description
Project portal dedicated to understand animal and machine intelligence and repository of data and tools. Suite of tools to analyze and graph imaging data. Image and data repository for large, publicly available neuro-specific data files and images. Contains tools for analytics, databases, cloud computing, and Web-services applied to both big neuroimages and big neurographs.
Z
Literature on Cloud Capacity Planning
data.niaid.nih.gov
Updated Aug 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreadis, Georgios (2020). Literature on Cloud Capacity Planning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3989101
Explore at:
Dataset updated
Aug 18, 2020
Dataset authored and provided by
Andreadis, Georgios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This release captures the state-of-the-art (as of 2020) in cloud capacity planning literature and provides a set of complementary scripts to analyze this literature. The dataset which is central to this release (publications.yaml) maps 57 cloud capacity planning approaches as published in literature to the taxonomy on cloud capacity planning which the authors of this release have proposed. The approaches were gathered with a systematic literature survey process, aggregating multiple common sources and executing a set of automated and manual filtering steps.

Taxonomy

The taxonomy and the process used to derive it is described in detail in the MSc Thesis of Georgios Andreadis at Delft University of Technology (to be published end of August 2020), on cloud capacity planning. We describe the taxonomy here to provide context to the raw data.

The taxonomy divides the process underlying capacity planning systems into the following categories:

System Model

Workloads

Resources

Model Inputs

Forecast Model

Modeling Strategy

Model Structure

Decision Support

Role

Type of Advice

Advice Method

For each of these categories, the taxonomy prescribes a set of possible classes (possible instantiations of the category). We list these for each category, below, preceded by its abbrevation as appearing in the dataset:

System Model

Workloads

VM: Virtual Machines

DB: Databases

S: Streaming Workloads

BD: Big Data Frameworks

WS: Web Service

B: Batch Jobs

Resources

C: Compute Hardware

S: Storage Hardware

N: Network Hardware

E: Energy Hardware (Storage and Supply)

H: Heat Control Hardware

V: Virtualized Resources (VM, containers, etc.)

Model Inputs

H: Historical Data

RS: Resource Specifications

B: (Micro)Benchmarks or Systematic Performance Tests

S: SLAs

P: Pricing Data

LC: Lease Contracts

HP: Human Personnel-related Factors

Forecast Model

Modeling Strategy

A: Analytical

S: Simulation

E: Real-world Experimentation

Model Structure

U: Unconditional Extrapolation

W: What-if Scenarios

Decision Support

Role

F: Forecast

A: Adaptation Advice

Type of Advice

N: Number of Resources

T: Type of Resources

L: Locality of Resources

Advice Method

H: Heuristic

R: Regression

L: Local Search

SS: Stochastic Search

SP: Stochastic Programming

NN: Neural Network

GT: Game Theory

GA: Genetic Algorithm

NLP: (Non)Linear Programming

File Structure

This release is structured as follows:

publications.yaml: This is the dataset of mappings of publications to the taxonomy. Each item in the array represents a publication, with a set of true-false classifications per category for each class.

The id field of each publication identifies the publication (first-author and publication year).

The summary field of each publication summarizes the publication in a short sentence.

The classification field contains a set of true-false classifications per category for each class.

The notes field is an optional field containing any additional notes kept by the author of this dataset on their classification, in the case where doubts arose during the classification process.

taxonomy.py: Script which parses the YAML dataset into different CSV views per category, to facilitate meta-analysis. Also prints out a full (long-table) representation of the mappings.

taxonomy_analysis.py: Jupyter notebook which contains several meta-analysis processing steps, including trend, cluster, and correlation analysis.

README.md: A file containing this description.
f
Adjacency list cancers
figshare.com
pdf
Updated Nov 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aparna Rai; Sarika Jalan (2016). Adjacency list cancers [Dataset]. http://doi.org/10.6084/m9.figshare.3498095.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3498095.v2
Dataset updated
Nov 2, 2016
Dataset provided by
figshare
Authors
Aparna Rai; Sarika Jalan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data for each network consist of two columns, and is in the form of an Adjacency list, which depicts the PPIs. Both the columns show how a protein is interacting with other protein.

There are fourteen files, where each cancer constitutes to have two files, one for normal state and other for the disease state.

Sample of the network:

Protein1 Protein 2

Protein 4 Protein 7

Protein 2 Protein 8

| | | |

Protein N Protein N-4

Where, N is the number of proteins in each of the network.

Note that the networks, which we provide here, are the complete networks. They are disconnected networks.

Apart from these fourteen networks, we have their connected components. Since, these fourteen networks are disconnected, for some analysis such as diameter, spectra, we have to consider the largest connected component/s (LCC) of these networks, which can easily be done from any algorithm.

We provide the details of connected components for each of these fourteen networks.

·
Please note that, the networks corresponding to Breast, Oral, Ovarian, Colon and Prostate consist of only one big connected component that is, for these five diseases there are ten connected components one for each normal and disease.

·
Further, for Cervical and Lung, there are more than one big connected components.
Bluesky Social Dataset
zenodo.org
application/gzip, csv
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14669616
Dataset updated
Jan 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License
https://bsky.social/about/support/toshttps://bsky.social/about/support/tos
Description
Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).

user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.

interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.

graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.

feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);

feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.

feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;

scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatase tmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

Acknowledgments:

This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);

SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;

EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
e
California Natural Diversity Database
knb.ecoinformatics.org
Updated Sep 12, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Landels-Hill Big Creek Reserve; University of California Natural Reserve System; Kurt Merg (2014). California Natural Diversity Database [Dataset]. http://doi.org/10.5063/AA/nrs.381.1
Explore at:
Unique identifier
https://doi.org/10.5063/AA/nrs.381.1
Dataset updated
Sep 12, 2014
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Landels-Hill Big Creek Reserve; University of California Natural Reserve System; Kurt Merg
Time period covered
Jan 1, 2005
Area covered

Description
This database receives data from many sources including but not limited to US Fish and Wildlife Service and California Department of Fish and Game. It provides lists and information regarding rare and threatened animals, plants, and ecological communities. It uses scientific classification to identify plants and animals. It also ranks species according to how rare or endangered they are both regionally and worldwide. Lists and reports are available in website, in pdf format. Other CNDDB data is contain in CNDDB data link which is password protected.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2015). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/

Most popular database management systems worldwide 2024

Explore at:

44 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Aug 14, 2015

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Jun 2024

Area covered

Worldwide

Description

As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.

Clear search

Close search

Google apps

Main menu

Most popular database management systems worldwide 2024

Fantastic databases and where to find them: Web applications for researchers...

Leading big data vendors in 2014-2017, by revenue

Student Skill Gap Analysis

Data from: Current and projected research data storage needs of Agricultural...

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

Data from: Edtech in Higher Education: Focus Groups, Database, and Documents...

B2B Leads Database | 500M+ B2B Contact Profiles | 100M+ B2B Mobile Numbers |...

Database of patient reviews expressing dissatisfaction with the quality of...

Data format and access

Sample structure and description of variables

Database application

List of ROAD user groups and their capabilities.

Data Cleaning Tools Market Report | Global Forecast From 2025 To 2033

Data Cleaning Tools Market Outlook

Component Analysis

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation)...

List of the number of localities and assemblages entered in ROAD.

Location Data | 3.5M+ Points of Interest (POI) in US and Canada | Places...

TetrapodTraits Database

Abstract

File content

Data from: neurodata

Literature on Cloud Capacity Planning

Adjacency list cancers

Bluesky Social Dataset

Bluesky Social Dataset

Dataset

Citation

Right to Erasure (Right to be forgotten)

Acknowledgments:

California Natural Diversity Database

Most popular database management systems worldwide 2024