11 datasets found
  1. MongoDB dump (compressed)

    • figshare.com
    7z
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1
    Explore at:
    7zAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Connor Coley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model

  2. h

    embedded_movies

    • huggingface.co
    Updated Feb 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MongoDB (2024). embedded_movies [Dataset]. https://huggingface.co/datasets/MongoDB/embedded_movies
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Dataset authored and provided by
    MongoDB
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    sample_mflix.embedded_movies

    This data set contains details on movies with genres of Western, Action, or Fantasy. Each document contains a single movie, and information such as its title, release year, and cast. In addition, documents in this collection include a plot_embedding field that contains embeddings created using OpenAI's text-embedding-ada-002 embedding model that you can use with the Atlas Search vector search feature.

      Overview
    

    This dataset offers a… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/embedded_movies.

  3. Z

    Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • data.niaid.nih.gov
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682
    Explore at:
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    University of Insubria
    KTH Royal Institute of Technology
    Foundation for Research and Technology Hellas
    Aristotle University of Thessaloniki
    Earkick
    Authors
    Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas
    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    { _id: id (or user_id): type: data: }

    Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

    Surveys Encoding

    BREQ2

    Why do you engage in exercise?

        Code
        Text
    
    
        engage[SQ001]
        I exercise because other people say I should
    
    
        engage[SQ002]
        I feel guilty when I don’t exercise
    
    
        engage[SQ003]
        I value the benefits of exercise
    
    
        engage[SQ004]
        I exercise because it’s fun
    
    
        engage[SQ005]
        I don’t see why I should have to exercise
    
    
        engage[SQ006]
        I take part in exercise because my friends/family/partner say I should
    
    
        engage[SQ007]
        I feel ashamed when I miss an exercise session
    
    
        engage[SQ008]
        It’s important to me to exercise regularly
    
    
        engage[SQ009]
        I can’t see why I should bother exercising
    
    
        engage[SQ010]
        I enjoy my exercise sessions
    
    
        engage[SQ011]
        I exercise because others will not be pleased with me if I don’t
    
    
        engage[SQ012]
        I don’t see the point in exercising
    
    
        engage[SQ013]
        I feel like a failure when I haven’t exercised in a while
    
    
        engage[SQ014]
        I think it is important to make the effort to exercise regularly
    
    
        engage[SQ015]
        I find exercise a pleasurable activity
    
    
        engage[SQ016]
        I feel under pressure from my friends/family to exercise
    
    
        engage[SQ017]
        I get restless if I don’t exercise regularly
    
    
        engage[SQ018]
        I get pleasure and satisfaction from participating in exercise
    
    
        engage[SQ019]
        I think exercising is a waste of time
    

    PANAS

    Indicate the extent you have felt this way over the past week

        P1[SQ001]
        Interested
    
    
        P1[SQ002]
        Distressed
    
    
        P1[SQ003]
        Excited
    
    
        P1[SQ004]
        Upset
    
    
        P1[SQ005]
        Strong
    
    
        P1[SQ006]
        Guilty
    
    
        P1[SQ007]
        Scared
    
    
        P1[SQ008]
        Hostile
    
    
        P1[SQ009]
        Enthusiastic
    
    
        P1[SQ010]
        Proud
    
    
        P1[SQ011]
        Irritable
    
    
        P1[SQ012]
        Alert
    
    
        P1[SQ013]
        Ashamed
    
    
        P1[SQ014]
        Inspired
    
    
        P1[SQ015]
        Nervous
    
    
        P1[SQ016]
        Determined
    
    
        P1[SQ017]
        Attentive
    
    
        P1[SQ018]
        Jittery
    
    
        P1[SQ019]
        Active
    
    
        P1[SQ020]
        Afraid
    

    Personality

    How Accurately Can You Describe Yourself?

        Code
        Text
    
    
        ipip[SQ001]
        Am the life of the party.
    
    
        ipip[SQ002]
        Feel little concern for others.
    
    
        ipip[SQ003]
        Am always prepared.
    
    
        ipip[SQ004]
        Get stressed out easily.
    
    
        ipip[SQ005]
        Have a rich vocabulary.
    
    
        ipip[SQ006]
        Don't talk a lot.
    
    
        ipip[SQ007]
        Am interested in people.
    
    
        ipip[SQ008]
        Leave my belongings around.
    
    
        ipip[SQ009]
        Am relaxed most of the time.
    
    
        ipip[SQ010]
        Have difficulty understanding abstract ideas.
    
    
        ipip[SQ011]
        Feel comfortable around people.
    
    
        ipip[SQ012]
        Insult people.
    
    
        ipip[SQ013]
        Pay attention to details.
    
    
        ipip[SQ014]
        Worry about things.
    
    
        ipip[SQ015]
        Have a vivid imagination.
    
    
        ipip[SQ016]
        Keep in the background.
    
    
        ipip[SQ017]
        Sympathize with others' feelings.
    
    
        ipip[SQ018]
        Make a mess of things.
    
    
        ipip[SQ019]
        Seldom feel blue.
    
    
        ipip[SQ020]
        Am not interested in abstract ideas.
    
    
        ipip[SQ021]
        Start conversations.
    
    
        ipip[SQ022]
        Am not interested in other people's problems.
    
    
        ipip[SQ023]
        Get chores done right away.
    
    
        ipip[SQ024]
        Am easily disturbed.
    
    
        ipip[SQ025]
        Have excellent ideas.
    
    
        ipip[SQ026]
        Have little to say.
    
    
        ipip[SQ027]
        Have a soft heart.
    
    
        ipip[SQ028]
        Often forget to put things back in their proper place.
    
    
        ipip[SQ029]
        Get upset easily.
    
    
        ipip[SQ030]
        Do not have a good imagination.
    
    
        ipip[SQ031]
        Talk to a lot of different people at parties.
    
    
        ipip[SQ032]
        Am not really interested in others.
    
    
        ipip[SQ033]
        Like order.
    
    
        ipip[SQ034]
        Change my mood a lot.
    
    
        ipip[SQ035]
        Am quick to understand things.
    
    
        ipip[SQ036]
        Don't like to draw attention to myself.
    
    
        ipip[SQ037]
        Take time out for others.
    
    
        ipip[SQ038]
        Shirk my duties.
    
    
        ipip[SQ039]
        Have frequent mood swings.
    
    
        ipip[SQ040]
        Use difficult words.
    
    
        ipip[SQ041]
        Don't mind being the centre of attention.
    
    
        ipip[SQ042]
        Feel others' emotions.
    
    
        ipip[SQ043]
        Follow a schedule.
    
    
        ipip[SQ044]
        Get irritated easily.
    
    
        ipip[SQ045]
        Spend time reflecting on things.
    
    
        ipip[SQ046]
        Am quiet around strangers.
    
    
        ipip[SQ047]
        Make people feel at ease.
    
    
        ipip[SQ048]
        Am exacting in my work.
    
    
        ipip[SQ049]
        Often feel blue.
    
    
        ipip[SQ050]
        Am full of ideas.
    

    STAI

    Indicate how you feel right now

        Code
        Text
    
    
        STAI[SQ001]
        I feel calm
    
    
        STAI[SQ002]
        I feel secure
    
    
        STAI[SQ003]
        I am tense
    
    
        STAI[SQ004]
        I feel strained
    
    
        STAI[SQ005]
        I feel at ease
    
    
        STAI[SQ006]
        I feel upset
    
    
        STAI[SQ007]
        I am presently worrying over possible misfortunes
    
    
        STAI[SQ008]
        I feel satisfied
    
    
        STAI[SQ009]
        I feel frightened
    
    
        STAI[SQ010]
        I feel comfortable
    
    
        STAI[SQ011]
        I feel self-confident
    
    
        STAI[SQ012]
        I feel nervous
    
    
        STAI[SQ013]
        I am jittery
    
    
        STAI[SQ014]
        I feel indecisive
    
    
        STAI[SQ015]
        I am relaxed
    
    
        STAI[SQ016]
        I feel content
    
    
        STAI[SQ017]
        I am worried
    
    
        STAI[SQ018]
        I feel confused
    
    
        STAI[SQ019]
        I feel steady
    
    
        STAI[SQ020]
        I feel pleasant
    

    TTM

    Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

        Code
        Text
    
    
        processes[SQ002]
        I read articles to learn more about physical
    
  4. D

    NoSQL Software Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). NoSQL Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-nosql-software-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    NoSQL Software Market Outlook



    The global NoSQL software market size was valued at approximately USD 6 billion in 2023 and is projected to reach around USD 20 billion by 2032, growing at a compound annual growth rate (CAGR) of 14% during the forecast period. This market is driven by the escalating need for operational efficiency, flexibility, and scalability in database management systems, particularly in enterprises dealing with vast amounts of unstructured data.



    One of the primary growth factors propelling the NoSQL software market is the exponential increase in data volumes generated by various digital platforms, IoT devices, and social media. Traditional relational databases often struggle to handle this surge efficiently, prompting organizations to shift towards NoSQL databases that offer more flexibility and scalability. The ability to store and process large sets of unstructured data without needing a predefined schema makes NoSQL databases an attractive choice for modern businesses seeking agility and speed in data management.



    Moreover, the proliferation of cloud computing services has significantly contributed to the growth of the NoSQL software market. Cloud-based NoSQL databases provide cost-effective, scalable, and easily accessible solutions for enterprises of all sizes. The pay-as-you-go pricing model and the capacity to scale resources based on demand have made NoSQL databases a preferred option for startups and large enterprises alike. The seamless integration of NoSQL databases with cloud infrastructure enhances operational efficiencies and reduces the complexities associated with database management.



    Another critical driver is the increasing adoption of NoSQL databases in various industry verticals such as retail, BFSI, IT, and healthcare. These industries require robust data management solutions to handle large volumes of diverse data types. NoSQL databases, with their flexible data models and high performance, cater to these requirements efficiently. In the retail sector, for example, NoSQL databases are used to manage customer data, product catalogs, and transaction histories, enabling more personalized and efficient customer services.



    Regionally, North America holds a significant share of the NoSQL software market due to the presence of major technology companies and a mature IT infrastructure. The rapid digital transformation across enterprises in the region, alongside substantial investments in big data analytics and cloud computing, further fuels market growth. Additionally, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the expanding IT sector, increased adoption of cloud services, and significant investments in digital technologies in countries like China and India.



    Graph Databases Software has emerged as a crucial component in the landscape of NoSQL databases, particularly for applications that require understanding complex relationships between data entities. Unlike traditional databases that store data in tables, graph databases use nodes, edges, and properties to represent and store data, making them ideal for scenarios where relationships are as important as the data itself. This approach is particularly beneficial in fields such as social networking, where the ability to analyze connections between users can provide deep insights into social dynamics and influence patterns. As businesses increasingly seek to leverage data for competitive advantage, the demand for graph databases is expected to grow, driven by their ability to efficiently model and query interconnected data.



    Type Analysis



    The NoSQL software market is segmented into various types, including Document-Oriented, Key-Value Store, Column-Oriented, and Graph-Based databases. Document-oriented databases, such as MongoDB, store data in JSON-like documents, offering flexibility in data modeling and ease of use. These databases are widely used for content management systems, e-commerce applications, and real-time analytics. Their ability to handle semi-structured data and scalability features make them a popular choice among developers and enterprises seeking agile database solutions.



    Key-Value Store databases, such as Redis and Amazon DynamoDB, store data as a collection of key-value pairs, providing ultra-fast read and write operations. These databases are ideal for applications requiring high-speed data retrieval, such as caching, session manag

  5. A one percent sample of German Twitter retweet traffic

    • zenodo.org
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nane Kratzke; Nane Kratzke (2023). A one percent sample of German Twitter retweet traffic [Dataset]. http://doi.org/10.5281/zenodo.7669923
    Explore at:
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nane Kratzke; Nane Kratzke
    Description

    This dataset includes a one percent sample of German-language Twitter retweets in Twitter raw data format. For each day, all retweets are stored in json data format (one entry per line).

    The dataset was recorded using Tweepy and exported from a MongoDB database. It is intended to be imported into a MongoDB database to run analytical queries. It is not intended to be processed as is.

    The dataset covers 60 consecutive days and ends on 01/25/2023.

    The dataset was recorded as part of this study.

    Kratzke, N. How to Find Orchestrated Trolls? A Case Study on Identifying Polarized Twitter Echo Chambers. Computers 2023, 12, 57. https://doi.org/10.3390/computers12030057

  6. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6832242?locale=fr
    Explore at:
    unknown(642961582)Available download formats
    Dataset updated
    Jul 12, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

  7. Job Postings from Ireland (October 2021)

    • kaggle.com
    zip
    Updated Apr 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Techmap.io (2023). Job Postings from Ireland (October 2021) [Dataset]. https://www.kaggle.com/datasets/techmap/job-postings-ireland-october-2021
    Explore at:
    zip(56469415 bytes)Available download formats
    Dataset updated
    Apr 16, 2023
    Authors
    Techmap.io
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Ireland
    Description

    Context

    This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 24621 Job Postings from Ireland in October 2021.

    Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.

    More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.

    We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.

    Inspiration

    This dataset was created to help data scientists and researchers across the world.

    License

    This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)

    Available Fields

    (as generated by variety.js)

    +----------------------------------------------------
    | key           | types   | Explanation
    | ------------------------| ----------| -------------
    | _id           | ObjectId | Unique ID from the MongoDB
    | companyID        | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
    | contact         | Object  | Map/Object with contact info from the JSON, HTML or extracted from job posting
    | contact.email      | String  | Corporate email address mentioned from JSON or job posting
    | contact.phone      | String  | Corporate phone address extracted from JSON or job posting
    | dateCreated       | Date   | Date the job posting was created (or date scraped if creation date is not available)
    | dateExpired       | Date   | Date the job posting expires
    | dateScraped       | Date   | Date the job posting was scraped
    | html          | String  | The raw HTML of the job description (can be plain text for some sources)
    | idInSource       | String  | An id used in the source portal (unique for the source)
    | json          | Object  | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
    | locale         | String  | Locale extracted from the JSON or job posting (e.g., "en_US")
    | locationID       | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
    | name          | String  | Title or Name of the job posting
    | orgAddress       | Object  | Original address data extracted from the job posting
    | orgAddress.addressLine | String  | Raw address line - mostly just a city name
    | orgAddress.city     | String  | City name from JSON, HTML or extracted from addressLine
    | orgAddress.companyName | String  | Company name from JSON, HTML or extracted from addressLine
    | orgAddress.country   | String  | Country name from JSON, HTML or extracted from addressLine
    | orgAddress.countryCode | String  | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
    | orgAddress.county    | String  | County name from JSON, HTML or extracted from addressLine
    | orgAddress.district   | String  | (City) District name from JSON, HTML or extracted from addressLine
    | orgAddress.formatted  | String  | Formatted address data extracted from the job posting
    | orgAddress.geoPoint   | Object  | Map of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lat | Number  | Latitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lng | Number  | Longitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.houseNumber | String  | House number extracted from the street or from JSON, HTML or extracted from addressLine
    | orgAddress.level    | Number  | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
    | orgAddress.postCode   | String  | Postal code / zip code extracted from JSON, HTML or addressLine
    | orgAddress.quarter   | String  | (City) Quarter name from JSON, HTML or extracted from addressLine
    | orgAddress.state    | String  | State name or abbreviation from JSON, HTML or extracted from addressLine
    | orgAddress.street    | String  | Street name (and maybe housen...
    
  8. Screenshots and metadata for 214 reCAPTCHA challenges encountered between...

    • data-staging.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Pettis (2024). Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023 [Dataset]. http://doi.org/10.5061/dryad.h70rxwdsr
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    University of Wisconsin–Madison
    Authors
    Ben Pettis
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet. This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis bpettis@wisc.edu. A browsable and searchable version of the data is also available at https://capturingcaptcha.com Methods I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website. The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website. The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool: mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]

  9. US Job Postings from 2023-05-05

    • kaggle.com
    zip
    Updated May 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Techmap.io (2023). US Job Postings from 2023-05-05 [Dataset]. https://www.kaggle.com/datasets/techmap/us-job-postings-from-2023-05-05/discussion?sort=undefined
    Explore at:
    zip(805159819 bytes)Available download formats
    Dataset updated
    May 10, 2023
    Authors
    Techmap.io
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Context

    This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 33k Job Postings from the USA on May 5th 2023.

    Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.

    More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.

    We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.

    Inspiration

    This dataset was created to help data scientists and researchers across the world.

    License

    This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)

    Content

    Total Records Count: 33064 Sources: 29 job boards (174 with country-portals) such as CareerBuilder, EURES, Monster, or Linkedin Date Range: 5. May 2023 - 5. May 2023 File Extension: JSON

    Available Fields

    (as generated by variety.js)

    +----------------------------------------------------
    | key           | types   | Explanation
    | ------------------------| ----------| -------------
    | _id           | ObjectId | Unique ID from the MongoDB
    | companyID        | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
    | contact         | Object  | Map/Object with contact info from the JSON, HTML or extracted from job posting
    | contact.email      | String  | Corporate email address mentioned from JSON or job posting
    | contact.phone      | String  | Corporate phone address extracted from JSON or job posting
    | dateCreated       | Date   | Date the job posting was created (or date scraped if creation date is not available)
    | dateExpired       | Date   | Date the job posting expires
    | dateScraped       | Date   | Date the job posting was scraped
    | html          | String  | The raw HTML of the job description (can be plain text for some sources)
    | idInSource       | String  | An id used in the source portal (unique for the source)
    | json          | Object  | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
    | locale         | String  | Locale extracted from the JSON or job posting (e.g., "en_US")
    | locationID       | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
    | name          | String  | Title or Name of the job posting
    | orgAddress       | Object  | Original address data extracted from the job posting
    | orgAddress.addressLine | String  | Raw address line - mostly just a city name
    | orgAddress.city     | String  | City name from JSON, HTML or extracted from addressLine
    | orgAddress.companyName | String  | Company name from JSON, HTML or extracted from addressLine
    | orgAddress.country   | String  | Country name from JSON, HTML or extracted from addressLine
    | orgAddress.countryCode | String  | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
    | orgAddress.county    | String  | County name from JSON, HTML or extracted from addressLine
    | orgAddress.district   | String  | (City) District name from JSON, HTML or extracted from addressLine
    | orgAddress.formatted  | String  | Formatted address data extracted from the job posting
    | orgAddress.geoPoint   | Object  | Map of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lat | Number  | Latitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.geoPoint.lng | Number  | Longitude of geo coordinate if stated in the JSON or job posting
    | orgAddress.houseNumber | String  | House number extracted from the street or from JSON, HTML or extracted from addressLine
    | orgAddress.level    | Number  | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
    | orgAddress.postCode   | String  | Postal code / zip code extracted from JSON, HTML or addressLine
    | orgAddress.quarter   | String  | (City) Quarter name from JSON, HTML or extracted fro...
    
  10. 785 Million Language Translation Database for AI

    • kaggle.com
    zip
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    zip(6504894854 bytes)Available download formats
    Dataset updated
    Aug 28, 2023
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  11. Inventory data for Pharmacy Website in JSON format

    • kaggle.com
    zip
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priti Poddar (2024). Inventory data for Pharmacy Website in JSON format [Dataset]. https://www.kaggle.com/datasets/pritipoddar/inventory-data-for-pharmacy-website-in-json-format
    Explore at:
    zip(14761 bytes)Available download formats
    Dataset updated
    Oct 22, 2024
    Authors
    Priti Poddar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset contains inventory data for a pharmacy e-commerce website in JSON format, designed for easy integration into MongoDB databases, making it ideal for MERN stack projects. It includes 10 fields:

    • drugName: Name of the drug
    • manufacturer: Drug manufacturer
    • image: URL of the product image
    • description: Detailed description of the drug
    • expiryDate: Expiry date of the drug
    • price: Price of the drug
    • sideEffects: Potential side effects
    • disclaimer: Important legal and medical disclaimers
    • category: Drug classification (e.g., pain relief, antibiotics)
    • countInStock: Quantity of the product available in stock

    This dataset is useful for developing pharmacy-related web applications, inventory management systems, or online medical stores using the MERN stack.

    Do not use for production-level purposes; use for project development only. Feel free to contribute if you find any mistakes or have suggestions.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1
Organization logoOrganization logo

MongoDB dump (compressed)

Explore at:
7zAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Connor Coley
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model

Search
Clear search
Close search
Google apps
Main menu