Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global market for Document Duplication Detection Software is experiencing robust growth, driven by the increasing need for efficient data management and enhanced security across various industries. The rising volume of digital documents, coupled with stricter regulatory compliance requirements (like GDPR and CCPA), is fueling the demand for solutions that can quickly and accurately identify duplicate files. This reduces storage costs, improves data quality, and minimizes the risk of data breaches. The market's expansion is further propelled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which enable more sophisticated and accurate duplicate detection. We estimate the current market size to be around $800 million in 2025, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is expected across various segments, including cloud-based and on-premise solutions, catering to diverse industry verticals such as legal, finance, healthcare, and government. Major players like Microsoft, IBM, and Oracle are contributing to market growth through their established enterprise solutions. However, the market also features several specialized players, like Hyper Labs and Auslogics, offering niche solutions catering to specific needs. While the increasing adoption of cloud-based solutions is a key trend, potential restraints include the initial investment costs for software implementation and the need for ongoing training and support. The integration challenges with existing systems and the potential for false positives can also impede wider adoption. The market's regional distribution is expected to see a significant contribution from North America and Europe, while the Asia-Pacific region is projected to exhibit substantial growth potential driven by increasing digitalization. The forecast period (2025-2033) presents significant opportunities for market expansion, driven by technological innovation and the growing awareness of data management best practices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains dataset splits of SoundDesc [1] and other supporting material for our paper:
Data leakage in cross-modal retrieval training: A case study [arXiv] [ieeexplore]
In our paper, we demonstrated that a data leakage problem in the previously published splits of SoundDesc leads to overly optimistic retrieval results.
Using an off-the-shelf audio fingerprinting software, we identified that the data leakage stems from duplicates in the dataset.
We define two new splits for the dataset: a cleaned split to remove the leakage and a group-filtered to avoid other kinds of weak contamination of the test data.
SoundDesc is a dataset which was automatically sourced from the BBC Sound Effects web page [2]. The results from our paper can be reproduced using clean_split01 and group_filtered_split01.
If you use the splits, please cite our work:
Benno Weck, Xavier Serra, "Data Leakage in Cross-Modal Retrieval Training: A Case Study," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10094617.
@INPROCEEDINGS{10094617,
author={Weck, Benno and Serra, Xavier},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Data Leakage in Cross-Modal Retrieval Training: A Case Study},
year={2023},
volume={},
number={},
pages={1-5},
doi={10.1109/ICASSP49357.2023.10094617}}
References:
[1] A. S. Koepke, A. -M. Oncescu, J. Henriques, Z. Akata and S. Albanie, "Audio Retrieval with Natural Language Queries: A Benchmark Study," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2022.3149712.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Counts Of unique entries according to number of occurrences and number of pathway labels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Excel file "A Systematic Literature Review on Machine Learning Techniques for Predicting Household Water Consumption" is structured to document a systematic review process. It contains six sheets, each representing a stage in the literature review. Here's a breakdown of each sheet:
Contains the raw collection of research articles from databases like Scopus.
Includes fields like: title
, author
, journal
, year
, source
, and abstract
.
Similar to the Initial Set but with duplicate entries removed.
Retains the same structure and content fields.
Articles that passed the abstract screening stage.
Same format as above, suggesting relevance was judged based on abstract content.
Articles that passed a deeper screening phase.
Likely assessed on fuller content, with some articles having missing abstracts (None
values present).
Evaluates articles based on 10 quality criteria (e.g., clarity of objectives, use of visuals, replicability).
Scores are numerical (0–1) and include calculated metrics like Quality of the report
, Credibility
, Rigor
, and Relevance
.
The most relevant and high-quality studies selected for the review.
Detailed columns include:
ML Technique (MLT)
MLT Characteristic
Type of Evaluation
Selection Factors
Benefits and Challenges
Type of Publication
and DOI
Our Cinematic Dataset is a carefully selected collection of audio files with rich metadata, providing a wealth of information for machine learning applications such as generative AI music, Music Information Retrieval (MIR), and source separation. This dataset is specifically created to capture the rich and expressive quality of cinematic music, making it an ideal training environment for AI models. This dataset, which includes chords, instrumentation, key, tempo, and timestamps, is an invaluable resource for those looking to push AI's bounds in the field of audio innovation.
Strings, brass, woodwinds, and percussion are among the instruments used in the orchestral ensemble, which is a staple of film music. Strings, including violins, cellos, and double basses, are vital for communicating emotion, while brass instruments, such as trumpets and trombones, contribute to vastness and passion. Woodwinds, such as flutes and clarinets, give texture and nuance, while percussion instruments bring rhythm and impact. The careful arrangement of these parts produces distinct cinematic soundscapes, making the genre excellent for teaching AI models to recognize and duplicate complicated musical patterns.
Training models on this dataset provides a unique opportunity to explore the complexities of cinematic composition. The dataset's emphasis on important cinematic components, along with cinematic music's natural emotional storytelling ability, provides a solid platform for AI models to learn and compose music that captures the essence of engaging storylines. As AI continues to push creative boundaries, this Cinematic Music Dataset is a valuable tool for anybody looking to harness the compelling power of music in the digital environment.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Ethics Reference No: 209113723/2023/1Source Code is available on Github. The datasets are used to reproduce the same results: https://github.com/DHollenbach/record-linkage-and-deduplication/blob/main/README.mdAbstract:The research emphasised the vital role of a Master Patient Index (MPI) solution in addressing the challenges public healthcare facilities face in eliminating duplicate patient records and improving record linkage. The study recognised that traditional MPI systems may have limitations in terms of efficiency and accuracy. To address this, the study focused on utilising machine learning techniques to enhance the effectiveness of MPI systems, aiming to support the growing record linkage healthcare ecosystem.It was essential to highlight that integrating machine learning into MPI systems is crucial for optimising their capabilities. The study aimed to improve data linking and deduplication processes within MPI systems by leveraging machine learning techniques. This emphasis on machine learning represented a significant shift towards more sophisticated and intelligent healthcare technologies. Ultimately, the goal was to ensure safe and efficient patient care, benefiting individuals and the broader healthcare industry.This research investigated the performance of five machine learning classification algorithms (random forests, extreme gradient boosting, logistic regression, stacking ensemble, and deep multilayer perceptron) for data linkage and deduplication on four datasets. These techniques improved data linking and deduplication for use in an MPI system.The findings demonstrate the applicability of machine learning models for effective data linkage and deduplication of electronic health records. The random forest algorithm achieved the best performance (identifying duplicates correctly) based on accuracy, F1-Score, and AUC-score for three datasets (Electronic Practice-Based Research Network (ePBRN): Acc = 99.83%, F1-score = 81.09%, AUC = 99.98%; Freely Extensible Biomedical Record Linkage (FEBRL) 3: Acc = 99.55%, F1-score = 96.29%, AUC = 99.77%; Custom-synthetic: Acc = 99.98%, F1-score = 99.18%, AUC = 99.99%). In contrast, the experimentation on the FEBRL4 dataset revealed that the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) and logistic regression algorithms outperformed the random forest algorithm. The performance results for the MLP-ANN were (FEBRL4: Acc = 99.93%, F1-score = 96.95%, AUC = 99.97%). For the logistic regression algorithm, the results were (FEBRL4: Acc = 99.99%, F1 = 96.91%, AUC = 99.97%).In conclusion, the results of this research have significant implications for the healthcare industry, as they are expected to enhance the utilisation of MPI systems and improve their effectiveness in the record linkage healthcare ecosystem. By improving patient record linking and deduplication, healthcare providers can ensure safer and more efficient care, ultimately benefiting patients and the industry.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The market for Duplicate File Finder and Remover Tools is experiencing robust growth, driven by the increasing volume of digital data generated and stored by both private and commercial users. The proliferation of devices (Windows, Mac, Android) and the ever-expanding storage capacities contribute to the problem of duplicate files cluttering systems and wasting valuable disk space. This leads to performance issues and the need for efficient solutions. While the exact market size in 2025 is unavailable, considering a reasonable CAGR of 15% (a conservative estimate given the technology's utility) from a 2019 baseline of $500 million (a plausible starting point given the established players), the market size could be estimated at approximately $1.2 Billion in 2025. The market is segmented by user type (private and commercial) and operating system (Windows, Mac, Android), with Windows currently holding the largest share due to its wider adoption. Key trends include the integration of AI and machine learning for improved accuracy and efficiency in detecting duplicates, and the rise of cloud-based solutions offering automated cleanup and centralized management. Restraints include the availability of free, basic tools and user reluctance to adopt paid software. However, the growing demand for advanced features such as file comparison, selective deletion, and data recovery capabilities fuels the growth of the premium segment. The competitive landscape is characterized by a mix of established players like Systweak Software, Auslogics Labs, and Piriform, alongside smaller, niche companies. These companies are focusing on differentiation through advanced features, intuitive user interfaces, and robust customer support. The market’s future growth will depend on factors such as ongoing technological advancements, increasing user awareness of duplicate file issues, and the expanding adoption of cloud storage and data management services. Geographic regions like North America and Europe are expected to maintain significant market shares due to higher technological adoption and digital literacy. However, the Asia-Pacific region presents significant growth potential, driven by increasing internet penetration and smartphone usage. Continued innovation and the development of user-friendly, highly effective solutions will be key to success in this expanding market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for duplicate contact remover apps is poised to experience significant growth, with an estimated valuation of $1.2 billion in 2023, projected to reach $2.8 billion by 2032, reflecting a robust CAGR of 9.5%. The primary growth factors driving this market include the increased adoption of smartphones, the proliferation of digital communication platforms, and the rising demand for efficient contact management solutions to streamline personal and professional communication.
The growth of the duplicate contact remover apps market is propelled largely by the increasing penetration of smartphones across the globe. As smartphones become more integral to daily life, managing contacts efficiently is crucial for both individual and enterprise users. Duplicate contacts can cause confusion, hinder effective communication, and lead to data inconsistency. Hence, there is a growing need for applications that can automatically identify and remove redundant contact entries, ensuring a seamless user experience. Furthermore, the rise in digital communication tools and social media platforms, which often result in multiple entries for the same contact, also contributes to the demand for such apps.
Another significant growth driver is the increasing awareness and emphasis on data cleanliness and accuracy. In an era where data is considered the new oil, maintaining accurate and clean contact databases is vital for effective communication and business operations. Duplicate contacts can lead to miscommunication, missed opportunities, and inefficiencies in customer relationship management (CRM) systems. Businesses are increasingly recognizing the importance of maintaining a clean contact database for improved operational efficiency, driving the adoption of duplicate contact remover apps. Additionally, advancements in AI and machine learning technologies enhance the capabilities of these apps, making them more efficient in identifying and merging duplicate entries.
The surge in remote work and the digital transformation of businesses further fuel the need for effective contact management solutions. With employees working from various locations and relying heavily on digital communication tools, the chances of duplicate contacts increase. Duplicate contact remover apps enable organizations to maintain a unified and accurate contact database, facilitating better communication and collaboration among remote teams. Moreover, the integration of these apps with popular CRM systems and email platforms adds to their utility and adoption, making them an essential tool for modern businesses.
In the realm of innovative solutions for maintaining cleanliness and efficiency, the Automated Facade Contact Cleaning Robot emerges as a groundbreaking technology. This robot is designed to address the challenges associated with cleaning high-rise building facades, which are often difficult and dangerous to maintain manually. By utilizing advanced robotics and automation, these robots can navigate complex surfaces, ensuring thorough cleaning without the need for human intervention. This not only enhances safety but also significantly reduces the time and cost involved in facade maintenance. The integration of such automated solutions is becoming increasingly prevalent in urban environments, where maintaining the aesthetic and structural integrity of buildings is paramount. As cities continue to grow and evolve, the demand for automated cleaning solutions like the Automated Facade Contact Cleaning Robot is expected to rise, offering a glimpse into the future of building maintenance.
Regionally, North America and Europe are expected to lead the market, driven by high smartphone penetration, advanced digital infrastructure, and the presence of major technology companies. Asia Pacific, however, is projected to witness the highest growth rate during the forecast period, owing to the rapid adoption of smartphones, increasing internet penetration, and the growing emphasis on digitalization in emerging economies. The market in Latin America and the Middle East & Africa is also anticipated to grow steadily as awareness about the benefits of contact management solutions increases.
In the context of operating systems, the market for duplicate contact remover apps is segmented into Android, iOS, Windows, and others. The Android segment is expected to dominate the market due to the la
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Market Size and Growth: The global file copy software market was valued at USD XXX million in 2025 and is projected to grow at a CAGR of XX% from 2025 to 2033, reaching USD XXX million by 2033. This growth is attributed to the rising demand for efficient and reliable data transfer solutions, the proliferation of cloud-based services, and the increasing adoption of personal and enterprise-level file sharing. Market Drivers and Trends: The key drivers of the file copy software market include the need for faster and more secure file transfers, the increasing volume of data being generated, and the growing popularity of collaborative workspaces. The market is also driven by the adoption of artificial intelligence (AI) and machine learning (ML) technologies, which enable file copy software to optimize performance and automate tasks. Other trends shaping the market include the emergence of cloud-native file copy solutions, the integration of file copy software with other business applications, and the growing awareness of data privacy and security regulations.
This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.
##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)
Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)
**Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*
submission.csv - A sample submission file in the correct format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kurdish News Summarization Dataset (KNSD) is a newly constructed and comprehensive dataset specifically curated for the task of news summarization in the Kurdish language. The dataset includes a collection of 130,000 news articles and their corresponding headlines sourced from popular Kurdish news websites such as Ktv, NRT, RojNews, K24, KNN, Kurdsat, and more. The KNSD has been meticulously compiled to encompass a diverse range of topics, covering various domains such as politics, economics, culture, sports, and regional affairs. This ensures that the dataset provides a comprehensive representation of the news landscape in the Kurdish language. Key Features Size and Variety: The dataset comprises a substantial collection of 130,000 news articles, offering a wide range of textual content for training and evaluating news summarization models in the Kurdish language. The articles are sourced from reputable and popular Kurdish news websites, ensuring credibility and authenticity. Article-Headline Pairs: Each news article in the KNSD is associated with its corresponding headline, allowing researchers and developers to explore the task of generating concise and informative summaries for news content specifically in Kurdish. Data Quality: Great attention has been given to ensuring the quality and reliability of the dataset. The articles and headlines have undergone careful curation and preprocessing to remove duplicates, ensure linguistic consistency, and filter out irrelevant or spam-like content. This guarantees that the dataset is of high quality and suitable for training robust and accurate news summarization models. Language and Cultural Context: The KNSD is specifically tailored for the Kurdish language, taking into account the unique linguistic characteristics and cultural context of the Kurdish-speaking population. This allows researchers to develop models that are attuned to the nuances and specificities of Kurdish news content. Applications: The KNSD can be utilized in various applications and research areas, including but not limited to: News Summarization: The dataset provides a valuable resource for developing and evaluating news summarization models specifically for the Kurdish language. Researchers can explore different techniques, such as extractive or abstractive summarization, to generate concise and coherent summaries of Kurdish news articles. Machine Learning and Natural Language Processing (NLP): The KNSD can be used to train and evaluate machine learning models, deep learning architectures, and NLP algorithms for tasks related to news summarization, text generation, and semantic understanding in the Kurdish language. The Kurdish News Summarization Dataset (KNSD) offers an extensive and diverse collection of news articles and headlines in the Kurdish language, providing researchers with a valuable resource for advancing the field of news summarization specifically for Kurdish-speaking audiences.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global paper duplicate checking service market is experiencing robust growth, driven by the increasing academic and professional emphasis on originality and the rising number of research publications and academic assignments. The market, estimated at $2 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $6 billion by 2033. This expansion is fueled by several key factors. The burgeoning adoption of online plagiarism detection tools by educational institutions and corporations forms a significant driver. Additionally, the rising availability of sophisticated algorithms capable of detecting even subtle forms of plagiarism, including paraphrasing and other sophisticated techniques, significantly contributes to market growth. Furthermore, the increasing accessibility of these services through various subscription models (Pay-Per-Use and Charge-By-Word) broadens the market's reach across different user segments, from individual students to large-scale research organizations. The market is segmented by application (School, Personal) and type of service (Pay-Per-Use, Charge By Word), allowing for tailored solutions based on user needs and budgets. Geographic distribution shows strong growth across North America and Asia-Pacific, driven by high internet penetration and robust educational sectors. However, regulatory challenges in certain regions and the potential for false positives in plagiarism detection represent key restraints. While the market enjoys considerable growth potential, challenges exist. The increasing sophistication of plagiarism techniques necessitates continuous advancements in detection algorithms. Competition among established players like EasyBib, Paper Rater, and CNKI is intensifying, requiring providers to innovate and offer competitive pricing and enhanced features to maintain their market share. Furthermore, ensuring accuracy and minimizing false positives is crucial for maintaining user trust and ensuring the ethical application of these services. Future growth will likely be driven by the incorporation of Artificial Intelligence (AI) and machine learning to enhance accuracy and efficiency, as well as by the development of services that integrate with existing academic and professional workflows. This growth trajectory highlights a significant opportunity for companies offering innovative and reliable plagiarism detection solutions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Today (2024-06-27), we discovered an issue with the labeling of sample groups in one of the supplementary figures (Supplementary Figure 14c) in our published article. We have corrected the figure and present it here, and we extend our apologies to all readers for any confusion this may have caused (although no report received).
The source data of supplementary figure 13 in the accompanying article table has been found to have issues, which were identified as a result of improper Excel operation. Here, we have uploaded the correct data table
The dataset contains features from 386 TCGA tumors for modeling ecDNA cargo gene prediction. It was converted from R data format with the following code. NOTE: columns 'sample' and 'gene_id' are not used for actual modeling but for identifying, and sampling purposes.
library(data.table)
data = readRDS("~/../Downloads/ecDNA_cargo_gene_modeling_data.rds")
colnames(data)[3] = "total_cn"
data.table::fwrite(data, file = "~/../Downloads/ecDNA_cargo_gene_modeling_data.csv.gz", sep = ",")
GCAP analysis results for PCAWG allele-specific copy number profiles derived from WGS.
GCAP analysis results for TCGA allele-specific copy number profiles derived from SNP6 array.
GCAP analysis results for SYSUCC Changkang allele-specific copy number profiles derived from tumor-normal paired WES.
These datasets contain TCGA gene-level copy number results in R data format from overlapping samples (dataset above). WGS from PCAWG, SNP array, and WES from GDC portal.
GCAP results of cell line batch 1 and batch 2.
AA software results for cell line batch 1.
AA software results for cell line batch 2.
Extended raw FISH images from 12 CRC samples.
Extended AA and GCAP analysis on SNU216.
Extended AA running files (all results) and result summary data for 6 GCAP predicted ERBB2 amp clinical samples.
source data of fig.4
source data of supp fig.2 subplots
source data of supp fig.15
GCAP result data objects for three ICB cohorts. Both gene-level and sample-level data included.
PDX-P68: processed (AA and CNV) data of P68 from WGS and WES data.
source data of supp fig.13
updated supplementary figure 14
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.
The dataset is provided as two files identifying GitHub repositories using the login-name/project-name convention. The file deduplicate_names contains 10,649,348 tab-separated records mapping a duplicated source project to a definitive target project.
The file forks_clones_noise_names is a 50,324,363 member superset of the source projects, containing also projects that were excluded from the mapping as noise.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
There are lots of datasets available for different machine learning tasks like NLP, Computer vision etc. However I couldn't find any dataset which catered to the domain of software testing. This is one area which has lots of potential for application of Machine Learning techniques specially deep-learning.
This was the reason I wanted such a dataset to exist. So, I made one.
New version [28th Nov'20]- Uploaded testing related questions and related details from stack-overflow. These are query results which were collected from stack-overflow by using stack-overflow's query viewer. The result set of this query contained posts which had the words "testing web pages".
New version[27th Nov'20] - Created a csv file containing pairs of test case titles and test case description.
This dataset is very tiny (approximately 200 rows of data). I have collected sample test cases from around the web and created a text file which contains all the test cases that I have collected. This text file has sections and under each section there are numbered rows of test cases.
I would like to thank websites like guru99.com, softwaretestinghelp.com and many other such websites which host great many sample test cases. These were the source for the test cases in this dataset.
My Inspiration to create this dataset was the scarcity of examples showcasing the implementation of machine learning on the domain of software testing. I would like to see if this dataset can be used to answer questions similar to the following--> * Finding semantic similarity between different test cases ranging across products and applications. * Automating the elimination of duplicate test cases in a test case repository. * Cana recommendation system be built for suggesting domain specific test cases to software testers.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset has two fields, query id and query text. Use it as validation set for your machine learning model or use it as a local dataset used in evaluation of your passage ranker.
If you use the MS MARCO dataset, or any dataset derived from it, please cite the paper:
@article{bajaj2016ms, title={Ms marco: A human generated machine reading comprehension dataset}, author={Bajaj, Payal and Campos, Daniel and Craswell, Nick and Deng, Li and Gao, Jianfeng and Liu, Xiaodong and Majumder, Rangan and McNamara, Andrew and Mitra, Bhaskar and Nguyen, Tri and others}, journal={arXiv preprint arXiv:1611.09268}, year={2016} }
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Market Overview The global Data Copy Management Software market is projected to reach [Market Size] by 2033, exhibiting a CAGR of XX% during the forecast period (2023-2033). The increasing demand for data protection and compliance drives this growth. Key segments of the market include cloud platform and on-premise deployment types, as well as applications in various industries such as banking, enterprise, government, and healthcare. Key Drivers and Trends The surge in data generation and regulatory mandates for data retention and compliance are the primary drivers of market growth. Additionally, the rising adoption of cloud-based storage and the need for efficient data management contribute to market expansion. Other trends shaping the market include the integration of machine learning and AI for data optimization and the emergence of new vendors offering specialized data copy management solutions. Although cost concerns and vendor lock-in remain as restraints, the overall market outlook is positive, driven by the increasing importance of data protection and compliance in today's digital landscape. Data Copy Management Software empowers businesses to efficiently manage and duplicate their data across diverse locations, promoting data security, compliance, and recovery. The global Data Copy Management Software market size is projected to reach USD 12.5 billion by 2027, showcasing a robust CAGR of 10.3% from 2022 to 2027.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset.