100+ datasets found
  1. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Iceland, Equatorial Guinea, Russian Federation, Saudi Arabia, Qatar, Antigua and Barbuda, Belize, Benin, Djibouti, Colombia
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  2. d

    Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...

    • datarade.ai
    .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori, Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event per Day [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-web-data-machine-learning-d-factori
    Explore at:
    .csvAvailable download formats
    Dataset authored and provided by
    Factori
    Area covered
    Austria, Egypt, Faroe Islands, Taiwan, Palestine, Uzbekistan, Cameroon, Japan, Turks and Caicos Islands, Sweden
    Description

    Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.

    Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.

    Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.

    Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.

    Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.

    We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.

    Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).

    Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts

  3. D

    Data Collection Software Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Data Collection Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-collection-software-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Dec 2, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Collection Software Market Outlook



    The global data collection software market size is anticipated to significantly expand from USD 1.8 billion in 2023 to USD 4.2 billion by 2032, exhibiting a CAGR of 10.1% during the forecast period. This remarkable growth is fueled by the increasing demand for data-driven decision-making solutions across various industries. As organizations continue to recognize the strategic value of harnessing vast amounts of data, the need for sophisticated data collection tools becomes more pressing. The growing integration of artificial intelligence and machine learning within software solutions is also a critical factor propelling the market forward, enabling more accurate and real-time data insights.



    One major growth factor for the data collection software market is the rising importance of real-time analytics. In an era where time-sensitive decisions can define business success, the capability to gather and analyze data in real-time is invaluable. This trend is particularly evident in sectors like healthcare, where prompt data collection can impact patient care, and in retail, where immediate insights into consumer behavior can enhance customer experience and drive sales. Additionally, the proliferation of the Internet of Things (IoT) has further accelerated the demand for data collection software, as connected devices produce a continuous stream of data that organizations must manage efficiently.



    The digital transformation sweeping across industries is another crucial driver of market growth. As businesses endeavor to modernize their operations and customer interactions, there is a heightened demand for robust data collection solutions that can seamlessly integrate with existing systems and infrastructure. Companies are increasingly investing in cloud-based data collection software to improve scalability, flexibility, and accessibility. This shift towards cloud solutions is not only enabling organizations to reduce IT costs but also to enhance collaboration by making data more readily available across different departments and geographies.



    The intensified focus on regulatory compliance and data protection is also shaping the data collection software market. With the introduction of stringent data privacy regulations such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States, organizations are compelled to adopt data collection practices that ensure compliance and protect customer information. This necessitates the use of sophisticated software capable of managing data responsibly and transparently, thereby fueling market growth. Moreover, the increasing awareness among businesses about the potential financial and reputational risks associated with data breaches is prompting the adoption of secure data collection solutions.



    Component Analysis



    The data collection software market can be segmented into software and services, each playing a pivotal role in the ecosystem. The software component remains the bedrock of this market, providing the essential tools and platforms that enable organizations to collect, store, and analyze data effectively. The software solutions offered vary in complexity and functionality, catering to different organizational needs ranging from basic data entry applications to advanced analytics platforms that incorporate AI and machine learning capabilities. The demand for such sophisticated solutions is on the rise as organizations seek to harness data not just for operational purposes but for strategic insights as well.



    The services segment encompasses various offerings that support the deployment and optimization of data collection software. These services include consulting, implementation, training, and maintenance, all crucial for ensuring that the software operates efficiently and meets the evolving needs of the user. As the market evolves, there is an increasing emphasis on offering customized services that address specific industry requirements, thereby enhancing the overall value proposition for clients. The services segment is expected to grow steadily as businesses continue to seek external expertise to complement their internal capabilities, particularly in areas such as data analytics and cybersecurity.



    Integration services have become particularly important as organizations strive to create seamless workflows that incorporate new data collection solutions with existing IT infrastructure. This need for integration is driven by the growing complexity of enterprise IT environments, where disparate systems and applications must wo

  4. The Test-Case Dataset

    • kaggle.com
    Updated Nov 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sapal6 (2020). The Test-Case Dataset [Dataset]. https://www.kaggle.com/datasets/sapal6/the-testcase-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sapal6
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    There are lots of datasets available for different machine learning tasks like NLP, Computer vision etc. However I couldn't find any dataset which catered to the domain of software testing. This is one area which has lots of potential for application of Machine Learning techniques specially deep-learning.

    This was the reason I wanted such a dataset to exist. So, I made one.

    Content

    New version [28th Nov'20]- Uploaded testing related questions and related details from stack-overflow. These are query results which were collected from stack-overflow by using stack-overflow's query viewer. The result set of this query contained posts which had the words "testing web pages".

    New version[27th Nov'20] - Created a csv file containing pairs of test case titles and test case description.

    This dataset is very tiny (approximately 200 rows of data). I have collected sample test cases from around the web and created a text file which contains all the test cases that I have collected. This text file has sections and under each section there are numbered rows of test cases.

    Acknowledgements

    I would like to thank websites like guru99.com, softwaretestinghelp.com and many other such websites which host great many sample test cases. These were the source for the test cases in this dataset.

    Inspiration

    My Inspiration to create this dataset was the scarcity of examples showcasing the implementation of machine learning on the domain of software testing. I would like to see if this dataset can be used to answer questions similar to the following--> * Finding semantic similarity between different test cases ranging across products and applications. * Automating the elimination of duplicate test cases in a test case repository. * Cana recommendation system be built for suggesting domain specific test cases to software testers.

  5. m

    Machine learning for corrosion database

    • data.mendeley.com
    Updated Oct 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonardo Bertolucci Coelho (2021). Machine learning for corrosion database [Dataset]. http://doi.org/10.17632/jfn8yhrphd.1
    Explore at:
    Dataset updated
    Oct 26, 2021
    Authors
    Leonardo Bertolucci Coelho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database was firstly created for the scientific article entitled: "Reviewing Machine Learning of corrosion prediction: a data-oriented perspective"

    L.B. Coelho 1 , D. Zhang 2 , Y.V. Ingelgem 1 , D. Steckelmacher 3 , A. Nowé 3 , H.A. Terryn 1

    1 Department of Materials and Chemistry, Research Group Electrochemical and Surface Engineering, Vrije Universiteit Brussel, Brussels, Belgium 2 A Beijing Advanced Innovation Center for Materials Genome Engineering, National Materials Corrosion and Protection Data Center, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, China 3 VUB Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium

    Different metrics are possible to evaluate the prediction accuracy of regression models. However, only papers providing relative metrics (MAPE, R²) were included in this database. We tried as much as possible to include descriptors of all major ML procedure steps, including data collection (“Data acquisition”), data cleaning feature engineering (“Feature reduction”), model validation (“Train-Test split”*), etc.

    *the total dataset is typically split into training sets and testing (unknown data) sets for performance evaluation of the model. Nonetheless, sometimes only the training or the testing performances were reported (“?” marks were added in the respective evaluation metric field(s)). The “Average R²” was sometimes considered for studies employing “CV” (cross-validation) on the dataset. For a detailed description of the ML basic procedures, the reader could refer to the References topic in the Review article.

  6. Ultimate Data Science Book Collection

    • kaggle.com
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayuri Awati (2023). Ultimate Data Science Book Collection [Dataset]. https://www.kaggle.com/datasets/mayuriawati/ultimate-data-science-book-collection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mayuri Awati
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The data set that I have compiled is based on a collection of books related to various topics in data science. I was inspired to create this data set because I wanted to gain insights into the popularity of different data science topics, as well as the most common words used in the titles or descriptions, and the most common authors or publishers in these areas.

    To collect the data set, I used the Google Books API, which allowed me to search for and retrieve information about books related to specific topics. I focused on topics such as Python for data science, R, SQL, statistics, machine learning, NLP, deep learning, data visualization, and data ethics, as I wanted to create a diverse and comprehensive data set that covered a wide range of data science subjects.

    The books included in the data set were written by various authors and published by different publishing houses, and I included books that were published within the past 10 years. I believe that this data set will be useful for anyone who is interested in data science, whether they are a beginner or an experienced practitioner. It can be used to build recommendation systems for books based on user interests, to identify gaps in the existing literature on a specific topic, or for general data analysis purposes.

    I hope that this data set will be a valuable resource for the data science community and will contribute to the advancement of the field.

  7. G

    Processed Lab Data for Neural Network-Based Shear Stress Level Prediction

    • gdr.openei.org
    • data.openei.org
    • +3more
    data
    Updated May 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Marone; Derek Elsworth; Jing Yang; Chris Marone; Derek Elsworth; Jing Yang (2021). Processed Lab Data for Neural Network-Based Shear Stress Level Prediction [Dataset]. http://doi.org/10.15121/1787545
    Explore at:
    dataAvailable download formats
    Dataset updated
    May 14, 2021
    Dataset provided by
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
    Pennsylvania State University
    Geothermal Data Repository
    Authors
    Chris Marone; Derek Elsworth; Jing Yang; Chris Marone; Derek Elsworth; Jing Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning can be used to predict fault properties such as shear stress, friction, and time to failure using continuous records of fault zone acoustic emissions. The files are extracted features and labels from lab data (experiment p4679). The features are extracted with a non-overlapping window from the original acoustic data. The first column is the time of the window. The second and third columns are the mean and the variance of the acoustic data in this window, respectively. The 4th-11th column is the the power spectrum density ranging from low to high frequency. And the last column is the corresponding label (shear stress level). The name of the file means which driving velocity the sequence is generated from. Data were generated from laboratory friction experiments conducted with a biaxial shear apparatus. Experiments were conducted in the double direct shear configuration in which two fault zones are sheared between three rigid forcing blocks. Our samples consisted of two 5-mm-thick layers of simulated fault gouge with a nominal contact area of 10 by 10 cm^2. Gouge material consisted of soda-lime glass beads with initial particle size between 105 and 149 micrometers. Prior to shearing, we impose a constant fault normal stress of 2 MPa using a servo-controlled load-feedback mechanism and allow the sample to compact. Once the sample has reached a constant layer thickness, the central block is driven down at constant rate of 10 micrometers per second. In tandem, we collect an AE signal continuously at 4 MHz from a piezoceramic sensor embedded in a steel forcing block about 22 mm from the gouge layer The data from this experiment can be used with the deep learning algorithm to train it for future fault property prediction.

  8. f

    Optimized parameter values for play detection.

    • plos.figshare.com
    xls
    Updated Apr 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Bischofberger; Arnold Baca; Erich Schikuta (2024). Optimized parameter values for play detection. [Dataset]. http://doi.org/10.1371/journal.pone.0298107.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Jonas Bischofberger; Arnold Baca; Erich Schikuta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.

  9. m

    Behaviour Biometrics Dataset

    • data.mendeley.com
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nonso Nnamoko (2022). Behaviour Biometrics Dataset [Dataset]. http://doi.org/10.17632/fnf8b85kr6.1
    Explore at:
    Dataset updated
    Jun 20, 2022
    Authors
    Nonso Nnamoko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.

    The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.

    The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-foldersraw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:

    -- raw_kmt_dataset': this folder contains 88 files, each namedraw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.

    -- feature_kmt_dataset': this folder contains two sub-folders, namely:feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each namedfeature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).

    -- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.

  10. PSYCHE-D: predicting change in depression severity using person-generated...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    • 35694 rows
    • 154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    • Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
    • Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
    • Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
    • Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

  11. o

    Sentinel-2 machine learning dataset for tree species classification in...

    • openagrar.de
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Freudenberg; Sebastian Schnell; Paul Magdon (2024). Sentinel-2 machine learning dataset for tree species classification in Germany [Dataset]. http://doi.org/10.3220/DATA20240402122351-0
    Explore at:
    Dataset updated
    Mar 5, 2024
    Dataset provided by
    University of Göttingen
    Thünen Institute of Forest Ecosystems
    University of Applied Sciences and Arts - HAWK, Göttingen
    Authors
    Maximilian Freudenberg; Sebastian Schnell; Paul Magdon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    The dataset contains time series of bottom of atmosphere (BOA) reflectance from the Sentinel-2 satellite mission for tree species classification in a machine learning context. BOA reflectance was computed with the FORCE processing engine (https://force-eo.readthedocs.io/en/latest/index.html) and the corresponding data cube is available at the CODE-DE (https://code-de.org/de/) or EO Lab (https://eo-lab.org/de/) platform. Alternatively, the BOA reflectance can be calculated using the provided FORCE parameter files (*.prm), guaranteeing that BOA values match the ones from the dataset. The time series were extracted from the FORCE data cube for individual tree positions as they are collected in the field by the German national forest inventory (NFI). A detailed description of NFI methodology is available here: https://bwi.info/Download/de/Methodik/. The timespan for the satellite observations is from July 2015 to October 2022 and BOA reflectance is labelled with tree species, diameter of the stem measured at a height of 1.3 m, height of the tree, area of the crown as projected to the ground, and additional variables. The dataset contains about 83 million data points from about 360.000 trees covering all environmental conditions in Germany. As reference for geolocation, the centre of the closest 1 km cell of the INSPIRE grid to the corresponding sampling unit of the NFI was used. The exact locations of the sampling units and individual tree positions are confidential. A short introduction on data access and analysis is provided in the Jupyter notebook (intro_to_dataset.ipynb) using Python. A description of the variables is provided below (Methodology) and in the database (table meta_col) along with a code table for the tree species (x_species). For a more detailed description of the dataset, the applied methodology and a discussion of error sources, please refer to the linked data publication paper. EPSG: 4326

  12. e

    The manifest and store data of 870,515 Android mobile applications - Dataset...

    • b2find.eudat.eu
    Updated Oct 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). The manifest and store data of 870,515 Android mobile applications - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b25ee20e-5268-50ae-9914-4bc70bd4ff1c
    Explore at:
    Dataset updated
    Oct 23, 2023
    Description

    We built a crawler to collect data from the Google Play store including the application's metadata and APK files. The manifest files were extracted from the APK files and then processed to extract the features. The data set is composed of 870,515 records/apps, and for each app we produced 48 features. The data set was used to built and test two bootstrap aggregating of multiple XGBoost machine learning classifiers. The dataset were collected between April 2017 and November 2018. We then checked the status of these applications on three different occasions; December 2018, February 2019, and May-June 2019.

  13. Physiological signals during activities for daily life: Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro (2022). Physiological signals during activities for daily life: Dataset [Dataset]. http://doi.org/10.5281/zenodo.6391454
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Gutierrez Maestro; Eduardo Gutierrez Maestro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this work is composed by four participants, two men and two women. Each of them carried the wearable device Empatica E4 for a total number of 15 days. They carried the wearable during the day, and during the nights we asked participants to charge and load the data into an external memory unit. During these days, participants were asked to answer EMA questionnaires which are used to label our data. However, some participants could not complete the full experiment or some days were discarded due to data corruption. Specific demographic information, total sampling days and total number of EMA answers can be found in table I.

    Participant 1Participant 2Participant 3Participant 4
    Age67556063
    GenderMaleFemaleMaleFemale

    Final Valid Days

    9151213
    Total EMAs42576446

    Table I. Summary of participants' collected data.

    This dataset provides three different type of labels. Activeness and happiness are two of these labels. These are the answers to EMA questionnaires that participants reported during their daily activities. These labels are numbers between 0 and 4.
    These labels are used to interpolate the mental well-being state according to [1] We report in our dataset a total number of eight emotional states: (1) pleasure, (2) excitement, (3) arousal, (4) distress, (5) misery, (6) depression, (7) sleepiness, and (8) contentment.

    The data we provide in this repository consist of two type of files:

    • CSV files: These files contain physiological signals recorded during the data collection process. The first line of each CSV file defines the timestamp by which data started being sampled. The second line defines the sampling frequency used for gathering the signal. From the third line until the end of the file, one can find sampled datapoints.
    • Excel files: These files contain the labels obtained from EMA answers. It is indicated the timestamp at which the answer was registered. Labels for pleasure, activeness and mood can be found in this file.

    NOTE: Files are numbered according to each specific sampling day. For example, ACC1.csv corresponds to the signal ACC for sampling day 1. The same applied to excel files.

    Code and a tutorial of how to labelled and extract features can be found in this repository: https://github.com/edugm94/temporal-feat-emotion-prediction

    References:

    [1] . A. Russell, “A circumplex model of affect,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980

  14. c

    EEG-BCI Dataset for "Continuous Tracking using Deep Learning-based Decoding...

    • kilthub.cmu.edu
    zip
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Forenzo; Bin He (2024). EEG-BCI Dataset for "Continuous Tracking using Deep Learning-based Decoding for Non-invasive Brain-Computer Interface" [Dataset]. http://doi.org/10.1184/R1/25360300.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 6, 2024
    Dataset provided by
    Carnegie Mellon University
    Authors
    Dylan Forenzo; Bin He
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This EEG Brain Computer Interface (BCI) dataset was collected as part of the study titled: “Continuous Tracking using Deep Learning-based Decoding for Non-invasive Brain-Computer Interface”. If you use a part of this dataset in your work, please cite the following publication: D. Forenzo, H. Zhu, J. Shanahan, J. Lim, and B. He, “Continuous tracking using deep learning-based decoding for noninvasive brain–computer interface,” PNAS Nexus, vol. 3, no. 4, p. pgae145, Apr. 2024, doi: 10.1093/pnasnexus/pgae145. This dataset was collected under support from the National Institutes of Health via grants AT009263, NS096761, NS127849, EB029354, NS124564, and NS131069 to Dr. Bin He. Correspondence about the dataset: Dr. Bin He, Carnegie Mellon University, Department of Biomedical Engineering, Pittsburgh, PA 15213. E-mail: bhe1@andrew.cmu.edu

  15. Machine Learning-Enhanced Multiphase CFD for Carbon Capture Modeling Run...

    • osti.gov
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    USDOE Office of Fossil Energy (FE) (2024). Machine Learning-Enhanced Multiphase CFD for Carbon Capture Modeling Run Data [Dataset]. http://doi.org/10.18141/2344941
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    National Energy Technology Laboratoryhttps://netl.doe.gov/
    USDOE Office of Fossil Energy (FE)
    Description

    Repository for the data generated as part of the 2023-2024 ALCC project "Machine Learning-Enhanced Multiphase CFD for Carbon Capture Modeling." The data was generated with MFIX-Exa's CFD-DEM model. The problem of interest is gravity driven, particle-laden, gas-solid flow in a triply-periodic domain of length 2048 particle diameters with an aspect ratio of 4. The mean particle concentration ranges from 1% to 40% and the Archimedes number ranges from 18 to 90. The particle-to-fluid density ratio, particle-particle restitution and friction coefficients and domain aspect ratio are held constant at values of 1000, 0.9, 0.25 and 4, respectively. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 using NERSC award ALCC-ERCAP0025948.

  16. Global Data Collection and Labeling Market Size By Type (Text, Image/Video),...

    • verifiedmarketresearch.com
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Data Collection and Labeling Market Size By Type (Text, Image/Video), By Application (Automotive, Healthcare), By Geographic Scope and Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/data-collection-and-labeling-market/
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2026 - 2032
    Area covered
    Global
    Description

    Data Collection and Labeling Market size was valued at USD 18.18 Billion in 2024 and is projected to reach USD 93.37 Billion by 2032 growing at a CAGR of 25.03% from 2026 to 2032.

    Key Market Drivers: • Increasing Reliance on Artificial Intelligence and Machine Learning: As AI and machine learning become more prevalent in numerous industries, the necessity for reliable data gathering and categorization grows. By 2025, the AI business is estimated to be worth $126 billion, emphasizing the significance of high-quality datasets for effective modeling. • Increasing Emphasis on Data Privacy and Compliance: With stronger requirements such as GDPR and CCPA, enterprises must prioritize data collection methods that assure privacy and compliance. The global data privacy industry is expected to grow to USD $6.7 Bbillion by 2023, highlighting the need for responsible data handling methods in labeling processes. • Emergence Of Advanced Data Annotation Tools: The emergence of enhanced data annotation tools is being driven by technological improvements, which are improving efficiency and lowering costs. Global Data Annotation tools market is expected to grow significantly, facilitating faster and more accurate labeling of data, essential for meeting the increasing demands of AI applications.

  17. E

    AAMOS-00 Study: Predicting Asthma Attacks Using Connected Mobile Devices and...

    • finddatagovscot.dtechtive.com
    • dtechtive.com
    • +1more
    csv, docx, xlsx
    Updated Oct 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Edinburgh. Edinburgh Medical School. Usher Institute (2022). AAMOS-00 Study: Predicting Asthma Attacks Using Connected Mobile Devices and Machine Learning [Dataset]. http://doi.org/10.7488/ds/3775
    Explore at:
    xlsx(0.0235 MB), csv(0.0509 MB), csv(3.229 MB), csv(30.7 MB), csv(0.0155 MB), csv(0.0039 MB), docx(0.0229 MB), csv(0.018 MB), csv(0.1885 MB), csv(0.0652 MB), csv(0.0613 MB), csv(30.96 MB)Available download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    University of Edinburgh. Edinburgh Medical School. Usher Institute
    Area covered
    UNITED KINGDOM
    Description

    Monitoring asthma condition is essential to asthma self-management. However, traditional methods of monitoring require high levels of active engagement and patients may regard this level of monitoring as tedious. Passive monitoring with mobile health devices, especially when combined with machine learning, provides an avenue to dramatically reduce management burden. However, data for developing machine learning algorithms are scarce, and gathering new data is expensive. A few asthma mHealth datasets are publicly available, but lack objective and passively collected data which may enhance asthma attack prediction systems. To fill this gap, we carried out the 2-phase, 7-month AAMOS-00 observational study to collect data about asthma status using three smart monitoring devices (smart peak flow meter, smart inhaler, smartwatch), and daily symptom questionnaires. Combined with localised weather, pollen, and air quality reports, we have collected a rich longitudinal dataset to explore the feasibility of passive monitoring and asthma attack prediction. Conducting phase 2 of device monitoring over 12 months, from June 2021 to June 2022 and during the COVID-19 pandemic, 22 participants across the UK provided 2,054 unique patient-days of data. This valuable anonymised dataset has been made publicly available with the consent of participants. Ethics approval was provided by the East of England - Cambridge Central Research Ethics Committee. IRAS project ID: 285505 with governance approval from ACCORD (Academic and Clinical Central Office for Research and Development), project number: AC20145. The study sponsor was ACCORD, the University of Edinburgh. The anonymised dataset was produced with statistical advice from Aryelly Rodriguez - Statistician, Edinburgh Clinical Trials Unit, University of Edinburgh. Protocol: 'Predicting asthma attacks using connected mobile devices and machine learning; the AAMOS-00 observational study protocol' - BMJ Open, DOI: 10.1136/bmjopen-2022-064166 # Thesis # Tsang, Kevin CH 'Application of data-driven technologies for asthma self-management' (2022) [Doctoral Thesis] University of Edinburgh https://era.ed.ac.uk/handle/1842/40547 The dataset also relates to the publication K.C.H. Tsang, H. Pinnock, A.M. Wilson, D. Salvi and S.A. Shah (2023). 'Home monitoring with connected mobile devices for asthma attack prediction with machine learning', Scientific Data 10 ( https://doi.org/10.1038/s41597-023-02241-9 ).

  18. d

    Data for: Hit2flux: A machine learning framework for boiling heat flux...

    • search.dataone.org
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christy Dunlap; Changgen Li; Hari Pandey; Han Hu (2025). Data for: Hit2flux: A machine learning framework for boiling heat flux prediction using hit-based acoustic emission sensing [Dataset]. http://doi.org/10.5061/dryad.g79cnp628
    Explore at:
    Dataset updated
    Jun 11, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Christy Dunlap; Changgen Li; Hari Pandey; Han Hu
    Description

    This paper presents Hit2Flux, a machine learning framework for boiling heat flux prediction using acoustic emission (AE) hits generated through threshold-based transient sampling. Unlike continuously sampled data, AE hits are recorded when the signal exceeds a predefined threshold and are thus discontinuous in nature. Meanwhile, each hit represents a waveform at a high sampling frequency (∼1 MHz). In order to capture the features of both the high-frequency waveforms and the temporal distribution of hits, Hit2Flux involves i) feature extraction by transforming AE hits into the frequency domain and organizing these spectra into sequences using a rolling window to form “sequences-of-sequences,†and ii) heat flux prediction using a long short-term memory (LSTM) network with sequences of sequences. The model is trained on AE hits recorded during pool boiling experiments using an AE sensor attached to the boiling chamber. Continuously sampled acoustic data using a hydrophone were also collect..., , # Data for: Hit2flux: A machine learning framework for boiling heat flux prediction using hit-based acoustic emission sensing

    Dataset DOI: 10.5061/dryad.g79cnp628

    Description of the data and file structure

    This dataset includes acoustic emission hit data and waveforms, hydrophone data, pressure, and temperature data from transient pool boiling tests on copper microchannels. The pool boiling test facility includes (a) a heating element that consists of a copper block and cartridge heaters; (b) a closed chamber with flow loops for a chiller (Thermo Scientific Polar ACCEL 500 Low/EA) connecting Graham condenser (Ace glass 5953-106) with an adapter (Ace glass 5838-76) and an in-house built coiled copper condenser; and (c) a synchronized multimodal sensing system. The copper block is submerged in deionized water and heated by nine cartridge heaters (Omega Engineering HDC19102), each with a power rating of 50 W, inserted from the bottom. The cartridge heaters ...,

  19. d

    Replication Data for: Active Learning Approaches for Labeling Text: Review...

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miller, Blake; Linder, Fridolin; Mebane, Walter (2023). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Miller, Blake; Linder, Fridolin; Mebane, Walter
    Description

    Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.

  20. u

    Data from: DIPSEER: A Dataset for In-Person Student Emotion and Engagement...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel (2025). DIPSEER: A Dataset for In-Person Student Emotion and Engagement Recognition in the Wild [Dataset]. https://observatorio-cientifico.ua.es/documentos/67321d21aea56d4af0484172
    Explore at:
    Dataset updated
    2025
    Authors
    Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel; Márquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Álvarez, Carolina Lorenzo; Fernandez-Herrero, Jorge; Viejo, Diego; Rosabel Roig-Vila; Cazorla, Miguel
    Description

    Data DescriptionThe DIPSER dataset is designed to assess student attention and emotion in in-person classroom settings, consisting of RGB camera data, smartwatch sensor data, and labeled attention and emotion metrics. It includes multiple camera angles per student to capture posture and facial expressions, complemented by smartwatch data for inertial and biometric metrics. Attention and emotion labels are derived from self-reports and expert evaluations. The dataset includes diverse demographic groups, with data collected in real-world classroom environments, facilitating the training of machine learning models for predicting attention and correlating it with emotional states.Data Collection and Generation ProceduresThe dataset was collected in a natural classroom environment at the University of Alicante, Spain. The recording setup consisted of six general cameras positioned to capture the overall classroom context and individual cameras placed at each student’s desk. Additionally, smartwatches were used to collect biometric data, such as heart rate, accelerometer, and gyroscope readings.Experimental SessionsNine distinct educational activities were designed to ensure a comprehensive range of engagement scenarios:News Reading – Students read projected or device-displayed news.Brainstorming Session – Idea generation for problem-solving.Lecture – Passive listening to an instructor-led session.Information Organization – Synthesizing information from different sources.Lecture Test – Assessment of lecture content via mobile devices.Individual Presentations – Students present their projects.Knowledge Test – Conducted using Kahoot.Robotics Experimentation – Hands-on session with robotics.MTINY Activity Design – Development of educational activities with computational thinking.Technical SpecificationsRGB Cameras: Individual cameras recorded at 640×480 pixels, while context cameras captured at 1280×720 pixels.Frame Rate: 9-10 FPS depending on the setup.Smartwatch Sensors: Collected heart rate, accelerometer, gyroscope, rotation vector, and light sensor data at a frequency of 1–100 Hz.Data Organization and FormatsThe dataset follows a structured directory format:/groupX/experimentY/subjectZ.zip Each subject-specific folder contains:images/ (individual facial images)watch_sensors/ (sensor readings in JSON format)labels/ (engagement & emotion annotations)metadata/ (subject demographics & session details)Annotations and LabelingEach data entry includes engagement levels (1-5) and emotional states (9 categories) based on both self-reported labels and evaluations by four independent experts. A custom annotation tool was developed to ensure consistency across evaluations.Missing Data and Data QualitySynchronization: A centralized server ensured time alignment across devices. Brightness changes were used to verify synchronization.Completeness: No major missing data, except for occasional random frame drops due to embedded device performance.Data Consistency: Uniform collection methodology across sessions, ensuring high reliability.Data Processing MethodsTo enhance usability, the dataset includes preprocessed bounding boxes for face, body, and hands, along with gaze estimation and head pose annotations. These were generated using YOLO, MediaPipe, and DeepFace.File Formats and AccessibilityImages: Stored in standard JPEG format.Sensor Data: Provided as structured JSON files.Labels: Available as CSV files with timestamps.The dataset is publicly available under the CC-BY license and can be accessed along with the necessary processing scripts via the DIPSER GitHub repository.Potential Errors and LimitationsDue to camera angles, some student movements may be out of frame in collaborative sessions.Lighting conditions vary slightly across experiments.Sensor latency variations are minimal but exist due to embedded device constraints.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025dipserdatasetinpersonstudent1, title={DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Carolina Lorenzo Álvarez and Jorge Fernandez-Herrero and Diego Viejo and Rosabel Roig-Vila and Miguel Cazorla}, year={2025}, eprint={2502.20209}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20209}, } Usage and ReproducibilityResearchers can utilize standard tools like OpenCV, TensorFlow, and PyTorch for analysis. The dataset supports research in machine learning, affective computing, and education analytics, offering a unique resource for engagement and attention studies in real-world classroom environments.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx

TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data

Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Iceland, Equatorial Guinea, Russian Federation, Saudi Arabia, Qatar, Antigua and Barbuda, Belize, Benin, Djibouti, Colombia
Description

We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

Search
Clear search
Close search
Google apps
Main menu