9 datasets found
  1. f

    MongoDB dump (compressed)

    • figshare.com
    7z
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1
    Explore at:
    7zAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Connor Coley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model

  2. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  3. o

    Transient Host Exchange

    • explore.openaire.eu
    Updated Oct 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    THEx Team; Yu-Jing Qin (2021). Transient Host Exchange [Dataset]. http://doi.org/10.5281/zenodo.5568961
    Explore at:
    Dataset updated
    Oct 15, 2021
    Authors
    THEx Team; Yu-Jing Qin
    Description

    The First Public Data Release (DR1) of Transient Host Exchange (THEx) Dataset Paper describing the dataset: “Linking Extragalactic Transients and their Host Galaxy Properties: Transient Sample, Multi-Wavelength Host Identification, and Database Construction” (Qin et al. 2021) The data release contains four compressed archives. “BSON export” is a binary export of the “host_summary” collection, which is the “full version” of the dataset. The schema was presented in the Appendix section of the paper. You need to set up a MongoDB server to use this version of the dataset. After setting up the server, you may import this BSON file into your local database as a collection using “mongorestore” command. You may find some useful tutorials for setting up the server and importing BSON files into your local database at: https://docs.mongodb.com/manual/installation/ https://www.mongodb.com/basics/bson You may run common operations like query and aggregation once you import this BSON snapshot into your local database. An official tutorial can be found at: https://docs.mongodb.com/manual/tutorial/query-documents/ There are other packages (e.g., pymongo for Python) and software to perform these database operations. “JSON export” is a compressed archive of JSON files. Each file, named by the unique id and the preferred name of the event, contains complete host data of a single event. The data schema and contents are identical to the “BSON” version. “NumPy export” contains a series of NumPy tables in “npy” format. There is a row-to-row correspondence across these files. Except for the “master table” (THEx-v8.0-release-assembled.npy), which contains all the columns, each file contains the host properties cross-matched in a single external catalog. The meta info and ancillary data are summarized in THEx-v8.0-release-assembled-index.npy. There is also a THEx-v8.0-release-typerowmask.npy file, which has rows co-indexed with other files and columns named after each transient type. The “rowmask” file allows you to select a subset of events under a specific transient type. Note that in this version, we only include cataloged properties of the confirmed hosts or primary candidates. If the confirmed host (or primary candidate) cross-matched multiple sources in a specific catalog, we only use the representative source for host properties. Properties of other cross-matched groups are not included. Finally, table THEx-v8.0-release-MWExt.npy contains the calculated foreground extinction (in magnitudes) at host positions. These extinction values have not been applied to magnitude columns in our dataset. You need to perform this correction by yourself if desired. “FITS export” includes the same individual tables as in “NumPy export”. However, the FITS standard limits the number of columns in a table. Therefore, we do not include the “master table” in “FITS export.” Finally, in BSON and JSON versions, cross-matched groups (under the “groups” key) are ordered by the default ranking function. Even if the first group in this list (namely, the confirmed host or primary host candidate) is a mismatched or misidentified one, we keep it in its original position. The result of visual inspection, including our manual reassignments, has been summarized under the “vis_insp” key. For NumPy and FITS versions, if we have manually reassigned the host of an event, the data presented in these tables are also updated accordingly. You may use the “case_code” column in the “index” file to find the result of visual inspection and manual reassignment, where the flags for this “case_code” column are summarized in case-code.txt. Generally, codes “A1” and “F1” are known and new hosts that passed our visual inspection, while codes “B1” and “G1” are mismatched known hosts and possibly misidentified new hosts that have been manually reassigned.

  4. Non Relational Databases Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Non Relational Databases Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/non-relational-databases-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Non Relational Databases Market Outlook



    The global market size for non-relational databases is expected to grow from USD 10.5 billion in 2023 to USD 35.2 billion by 2032, registering a Compound Annual Growth Rate (CAGR) of 14.6% over the forecast period. This substantial growth is primarily driven by increasing demand for scalable, flexible database solutions capable of handling diverse data types and large volumes of data generated across various industries.



    One of the significant growth factors for the non-relational databases market is the exponential increase in data generated globally. With the proliferation of Internet of Things (IoT) devices, social media platforms, and digital transactions, the volume of semi-structured and unstructured data is growing at an unprecedented rate. Traditional relational databases often fall short in efficiently managing such data types, making non-relational databases a preferred choice. For example, document-oriented databases like MongoDB allow for the storage of JSON-like documents, offering flexibility in data modeling and retrieval.



    Another key driver is the increasing adoption of non-relational databases among enterprises seeking agile and scalable database solutions. The need for high-performance applications that can scale horizontally and handle large volumes of transactions is pushing businesses to shift from traditional relational databases to non-relational databases. This is particularly evident in sectors like e-commerce, where the ability to manage customer data, product catalogs, and transaction histories in real-time is crucial. Additionally, companies in the BFSI (Banking, Financial Services, and Insurance) sector are leveraging non-relational databases for fraud detection, risk management, and customer relationship management.



    The advent of cloud computing and the growing trend of digital transformation are also significant contributors to the market growth. Cloud-based non-relational databases offer numerous advantages, including reduced infrastructure costs, scalability, and ease of access. As more organizations migrate their operations to the cloud, the demand for cloud-based non-relational databases is set to rise. Moreover, the availability of Database-as-a-Service (DBaaS) offerings from major cloud providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) is simplifying the deployment and management of these databases, further driving their adoption.



    Regionally, North America holds the largest market share, driven by the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The rapid digitalization, growing adoption of cloud services, and increasing investments in IT infrastructure in countries like China and India are propelling the demand for non-relational databases in the region. Additionally, the expanding e-commerce sector and the proliferation of smart devices are further boosting market growth in Asia Pacific.



    Type Analysis



    The non-relational databases market is segmented into several types, including Document-Oriented Databases, Key-Value Stores, Column-Family Stores, Graph Databases, and Others. Each type offers unique functionalities and caters to specific use cases, making them suitable for different industry requirements. Document-Oriented Databases, such as MongoDB and CouchDB, store data in document format (e.g., JSON or BSON), allowing for flexible schema designs and efficient data retrieval. These databases are widely used in content management systems, e-commerce platforms, and real-time analytics applications due to their ability to handle semi-structured data.



    Key-Value Stores, such as Redis and Amazon DynamoDB, store data as key-value pairs, providing extremely fast read and write operations. These databases are ideal for caching, session management, and real-time applications where speed is critical. They offer horizontal scalability and are highly efficient in managing large volumes of data with simple query requirements. The simplicity of the key-value data model and its performance benefits make it a popular choice for high-throughput applications.



    Column-Family Stores, such as Apache Cassandra and HBase, store data in columns rather than rows, allowing for efficient storage and retrieval of large datasets. These databases are designed to handle massive amounts of data across distributed systems, making them suitable for use cases involving big data analytics, time-seri

  5. A one percent sample of German Twitter retweet traffic

    • zenodo.org
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nane Kratzke; Nane Kratzke (2023). A one percent sample of German Twitter retweet traffic [Dataset]. http://doi.org/10.5281/zenodo.7669923
    Explore at:
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nane Kratzke; Nane Kratzke
    Description

    This dataset includes a one percent sample of German-language Twitter retweets in Twitter raw data format. For each day, all retweets are stored in json data format (one entry per line).

    The dataset was recorded using Tweepy and exported from a MongoDB database. It is intended to be imported into a MongoDB database to run analytical queries. It is not intended to be processed as is.

    The dataset covers 60 consecutive days and ends on 01/25/2023.

    The dataset was recorded as part of this study.

    Kratzke, N. How to Find Orchestrated Trolls? A Case Study on Identifying Polarized Twitter Echo Chambers. Computers 2023, 12, 57. https://doi.org/10.3390/computers12030057

  6. c

    Data Base Management Systems market size was USD 50.5 billion in 2022 !

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2025). Data Base Management Systems market size was USD 50.5 billion in 2022 ! [Dataset]. https://www.cognitivemarketresearch.com/data-base-management-systems-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    The global Data Base Management Systems market was valued at USD 50.5 billion in 2022 and is projected to reach USD 120.6 Billion by 2030, registering a CAGR of 11.5 % for the forecast period 2023-2030. Factors Affecting Data Base Management Systems Market Growth

    Growing inclination of organizations towards adoption of advanced technologies like cloud-based technology favours the growth of global DBMS market
    

    The cloud-based data base management system solutions offer the organizations with an ability to scale their database infrastructure up or down as per requirement. In a crucial business environment data volume can vary over time. Here, the cloud allows organizations to allocate resources in a dynamic and systematic manner, thereby, ensuring optimal performance without underutilization. In addition, these cloud-based solutions are cost-efficient. As, these cloud-based DBMS solutions eliminate the need for companies to maintain and invest in physical infrastructure and hardware. It helps in reducing ongoing operational costs and upfront capital expenditures. Organizations can choose pay-as-you-go pricing models, where they need to pay only for the resources they consume. Therefore, it has been a cost-efficient option for both smaller businesses and large-enterprises. Moreover, the cloud-based data base management system platforms usually come with management tools which streamline administrative tasks such as backup, provisioning, recovery, and monitoring. It allows IT teams to concentrate on more of strategic tasks rather than routine maintenance activities, thereby, enhancing operational efficiency. Whereas, these cloud-based data base management systems allow users to remote access and collaboration among teams, irrespective of their physical locations. Thus, in regards with today's work environment, which focuses on distributed and remote workforces. These cloud-based DBMS solution enables to access data and update in real-time through authorized personnel, allowing collaboration and better decision-making. Thus, owing to all the above factors, the rising adoption of advanced technologies like cloud-based DBMS is favouring the market growth.

    Availability of open-source solutions is likely to restrain the global data base management systems market growth
    

    Open-source data base management system solutions such as PostgreSQL, MongoDB, and MySQL, offer strong functionality at minimal or no licensing costs. It makes open-source solutions an attractive option for companies, especially start-ups or smaller businesses with limited budgets. As these open-source solutions offer similar capabilities to various commercial DBMS offerings, various organizations may opt for this solutions in order to save costs. The open-source solutions may benefit from active developer communities which contribute to their development, enhancement, and maintenance. This type of collaborative environment supports continuous innovation and improvement, which results into solutions that are slightly competitive with commercial offerings in terms of performance and features. Thus, the open-source solutions create competition for commercial DBMS market, they thrive in the market by offering unique value propositions, addressing needs of organizations which prioritize professional support, seamless integration into complex IT ecosystems, and advanced features. Introduction of Data Base Management Systems

    A Database Management System (DBMS) is a software which is specifically designed to organize and manage data in a structured manner. This system allows users to create, modify, and query a database, and also manage the security and access controls for that particular database. The DBMS offers tools for creating and modifying data models, that define the structure and relationships of data in a database. This system is also responsible for storing and retrieving data from the database, and also provide several methods for searching and querying the data. The data base management system also offers mechanisms to control concurrent access to the database, in order to ensure that number of users may access the data. The DBMS provides tools to enforce security constraints and data integrity, such as the constraints on the value of data and access controls that restricts who can access the data. The data base management system also provides mechanisms for recovering and backing up the data when a system failure occurs....

  7. US Job Postings from 2023-05-05

    • kaggle.com
    Updated May 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Techmap.io (2023). US Job Postings from 2023-05-05 [Dataset]. https://www.kaggle.com/datasets/techmap/us-job-postings-from-2023-05-05
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Techmap.io
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Context

    This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 33k Job Postings from the USA on May 5th 2023.

    Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.

    More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.

    We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.

    Inspiration

    This dataset was created to help data scientists and researchers across the world.

    License

    This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)

    Content

    Total Records Count: 33064 Sources: 29 job boards (174 with country-portals) such as CareerBuilder, EURES, Monster, or Linkedin Date Range: 5. May 2023 - 5. May 2023 File Extension: JSON

    Available Fields

    (as generated by variety.js)

    +----------------------------------------------------
    | key           | types   | Explanation
    | ------------------------| ----------| -------------
    | _id           | ObjectId | Unique ID from the MongoDB
    | companyID        | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
    | contact         | Object  | Map/Object with contact info from the JSON, HTML or extracted from job posting
    | contact.email      | String  | Corporate email address mentioned from JSON or job posting
    | contact.phone      | String  | Corporate phone address extracted from JSON or job posting
    | dateCreated       | Date   | Date the job posting was created (or date scraped if creation date is not available)
    | dateExpired       | Date   | Date the job posting expires
    | dateScraped       | Date   | Date the job posting was scraped
    | html          | String  | The raw HTML of the job description (can be plain text for some sources)
    | idInSource       | String  | An id used in the source portal (unique for the source)
    | json          | Object  | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
    | locale         | String  | Locale extracted from the JSON or job posting (e.g., "en_US")
    | locationID       | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
    | name          | String  | Title or Name of the job posting
    | orgAddress       | Object  | Original address data extracted from the job posting
    | orgAddress.addressLine | String  | Raw address line - mostly just a city name
    | orgAddress.city     | String  | City name from JSON, HTML or extracted from addressLine
    | orgAddress.companyName | String  | Company name from JSON, HTML or extracted from addressLine
    | orgAddress.country   | String  | Country name from JSON, HTML or extracted from addressLine
    | orgAddress.countryCode | String  | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
    | orgAddress.county    | String  | County name from JSON, HTML or extracted from addressLine
    | orgAddress.district   | String  | (City) District name from JSON, HTML or extracted from addressLine
    | orgAddress.formatted  | String  | Formatted address data extracted from the job posting
    | orgAddress.geoPoint   | Object  | Map of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lat | Number  | Latitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lng | Number  | Longitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.houseNumber | String  | House number extracted from the street or from JSON, HTML or extracted from addressLine
    | orgAddress.level    | Number  | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
    | orgAddress.postCode   | String  | Postal code / zip code extracted from JSON, HTML or addressLine
    | orgAddress.quarter   | String  | (City) Quarter name from JSON, HTML or extracted fro...
    
  8. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  9. Screenshots and metadata for 214 reCAPTCHA challenges encountered between...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Pettis (2024). Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023 [Dataset]. http://doi.org/10.5061/dryad.h70rxwdsr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    University of Wisconsin–Madison
    Authors
    Ben Pettis
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet. This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis bpettis@wisc.edu. A browsable and searchable version of the data is also available at https://capturingcaptcha.com Methods I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website. The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website. The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool: mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1

MongoDB dump (compressed)

Explore at:
7zAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Authors
Connor Coley
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model

Search
Clear search
Close search
Google apps
Main menu