11 datasets found

MongoDB dump (compressed)
figshare.com
7z
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4833482.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Connor Coley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model
h
embedded_movies
huggingface.co
Updated Feb 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MongoDB (2024). embedded_movies [Dataset]. https://huggingface.co/datasets/MongoDB/embedded_movies
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Dataset authored and provided by
MongoDB
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
sample_mflix.embedded_movies

This data set contains details on movies with genres of Western, Action, or Fantasy. Each document contains a single movie, and information such as its title, release year, and cast. In addition, documents in this collection include a plot_embedding field that contains embeddings created using OpenAI's text-embedding-ada-002 embedding model that you can use with the Atlas Search vector search feature.

Overview

This dataset offers a… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/embedded_movies.
Z
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.niaid.nih.gov
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6826682
Explore at:
Dataset updated
Oct 20, 2022
Dataset provided by
University of Insubria
KTH Royal Institute of Technology
Foundation for Research and Technology Hellas
Aristotle University of Thessaloniki
Earkick
Authors
Yfantidou, Sofia; Karagianni, Christina; Efstathiou, Stefanos; Vakali, Athena; Palotti, Joao; Giakatos, Dimitrios Panteleimon; Marchioro, Thomas; Kazlouski, Andrei; Ferrari, Elena; Girdzijauskas, Šarūnas
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id: id (or user_id): type: data: }

Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.

Surveys Encoding

BREQ2

Why do you engage in exercise?

Code Text engage[SQ001] I exercise because other people say I should engage[SQ002] I feel guilty when I don’t exercise engage[SQ003] I value the benefits of exercise engage[SQ004] I exercise because it’s fun engage[SQ005] I don’t see why I should have to exercise engage[SQ006] I take part in exercise because my friends/family/partner say I should engage[SQ007] I feel ashamed when I miss an exercise session engage[SQ008] It’s important to me to exercise regularly engage[SQ009] I can’t see why I should bother exercising engage[SQ010] I enjoy my exercise sessions engage[SQ011] I exercise because others will not be pleased with me if I don’t engage[SQ012] I don’t see the point in exercising engage[SQ013] I feel like a failure when I haven’t exercised in a while engage[SQ014] I think it is important to make the effort to exercise regularly engage[SQ015] I find exercise a pleasurable activity engage[SQ016] I feel under pressure from my friends/family to exercise engage[SQ017] I get restless if I don’t exercise regularly engage[SQ018] I get pleasure and satisfaction from participating in exercise engage[SQ019] I think exercising is a waste of time

PANAS

Indicate the extent you have felt this way over the past week

P1[SQ001] Interested P1[SQ002] Distressed P1[SQ003] Excited P1[SQ004] Upset P1[SQ005] Strong P1[SQ006] Guilty P1[SQ007] Scared P1[SQ008] Hostile P1[SQ009] Enthusiastic P1[SQ010] Proud P1[SQ011] Irritable P1[SQ012] Alert P1[SQ013] Ashamed P1[SQ014] Inspired P1[SQ015] Nervous P1[SQ016] Determined P1[SQ017] Attentive P1[SQ018] Jittery P1[SQ019] Active P1[SQ020] Afraid

Personality

How Accurately Can You Describe Yourself?

Code Text ipip[SQ001] Am the life of the party. ipip[SQ002] Feel little concern for others. ipip[SQ003] Am always prepared. ipip[SQ004] Get stressed out easily. ipip[SQ005] Have a rich vocabulary. ipip[SQ006] Don't talk a lot. ipip[SQ007] Am interested in people. ipip[SQ008] Leave my belongings around. ipip[SQ009] Am relaxed most of the time. ipip[SQ010] Have difficulty understanding abstract ideas. ipip[SQ011] Feel comfortable around people. ipip[SQ012] Insult people. ipip[SQ013] Pay attention to details. ipip[SQ014] Worry about things. ipip[SQ015] Have a vivid imagination. ipip[SQ016] Keep in the background. ipip[SQ017] Sympathize with others' feelings. ipip[SQ018] Make a mess of things. ipip[SQ019] Seldom feel blue. ipip[SQ020] Am not interested in abstract ideas. ipip[SQ021] Start conversations. ipip[SQ022] Am not interested in other people's problems. ipip[SQ023] Get chores done right away. ipip[SQ024] Am easily disturbed. ipip[SQ025] Have excellent ideas. ipip[SQ026] Have little to say. ipip[SQ027] Have a soft heart. ipip[SQ028] Often forget to put things back in their proper place. ipip[SQ029] Get upset easily. ipip[SQ030] Do not have a good imagination. ipip[SQ031] Talk to a lot of different people at parties. ipip[SQ032] Am not really interested in others. ipip[SQ033] Like order. ipip[SQ034] Change my mood a lot. ipip[SQ035] Am quick to understand things. ipip[SQ036] Don't like to draw attention to myself. ipip[SQ037] Take time out for others. ipip[SQ038] Shirk my duties. ipip[SQ039] Have frequent mood swings. ipip[SQ040] Use difficult words. ipip[SQ041] Don't mind being the centre of attention. ipip[SQ042] Feel others' emotions. ipip[SQ043] Follow a schedule. ipip[SQ044] Get irritated easily. ipip[SQ045] Spend time reflecting on things. ipip[SQ046] Am quiet around strangers. ipip[SQ047] Make people feel at ease. ipip[SQ048] Am exacting in my work. ipip[SQ049] Often feel blue. ipip[SQ050] Am full of ideas.

STAI

Indicate how you feel right now

Code Text STAI[SQ001] I feel calm STAI[SQ002] I feel secure STAI[SQ003] I am tense STAI[SQ004] I feel strained STAI[SQ005] I feel at ease STAI[SQ006] I feel upset STAI[SQ007] I am presently worrying over possible misfortunes STAI[SQ008] I feel satisfied STAI[SQ009] I feel frightened STAI[SQ010] I feel comfortable STAI[SQ011] I feel self-confident STAI[SQ012] I feel nervous STAI[SQ013] I am jittery STAI[SQ014] I feel indecisive STAI[SQ015] I am relaxed STAI[SQ016] I feel content STAI[SQ017] I am worried STAI[SQ018] I feel confused STAI[SQ019] I feel steady STAI[SQ020] I feel pleasant

TTM

Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?

Code Text processes[SQ002] I read articles to learn more about physical
D
NoSQL Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). NoSQL Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-nosql-software-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
NoSQL Software Market Outlook

The global NoSQL software market size was valued at approximately USD 6 billion in 2023 and is projected to reach around USD 20 billion by 2032, growing at a compound annual growth rate (CAGR) of 14% during the forecast period. This market is driven by the escalating need for operational efficiency, flexibility, and scalability in database management systems, particularly in enterprises dealing with vast amounts of unstructured data.

One of the primary growth factors propelling the NoSQL software market is the exponential increase in data volumes generated by various digital platforms, IoT devices, and social media. Traditional relational databases often struggle to handle this surge efficiently, prompting organizations to shift towards NoSQL databases that offer more flexibility and scalability. The ability to store and process large sets of unstructured data without needing a predefined schema makes NoSQL databases an attractive choice for modern businesses seeking agility and speed in data management.

Moreover, the proliferation of cloud computing services has significantly contributed to the growth of the NoSQL software market. Cloud-based NoSQL databases provide cost-effective, scalable, and easily accessible solutions for enterprises of all sizes. The pay-as-you-go pricing model and the capacity to scale resources based on demand have made NoSQL databases a preferred option for startups and large enterprises alike. The seamless integration of NoSQL databases with cloud infrastructure enhances operational efficiencies and reduces the complexities associated with database management.

Another critical driver is the increasing adoption of NoSQL databases in various industry verticals such as retail, BFSI, IT, and healthcare. These industries require robust data management solutions to handle large volumes of diverse data types. NoSQL databases, with their flexible data models and high performance, cater to these requirements efficiently. In the retail sector, for example, NoSQL databases are used to manage customer data, product catalogs, and transaction histories, enabling more personalized and efficient customer services.

Regionally, North America holds a significant share of the NoSQL software market due to the presence of major technology companies and a mature IT infrastructure. The rapid digital transformation across enterprises in the region, alongside substantial investments in big data analytics and cloud computing, further fuels market growth. Additionally, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the expanding IT sector, increased adoption of cloud services, and significant investments in digital technologies in countries like China and India.

Graph Databases Software has emerged as a crucial component in the landscape of NoSQL databases, particularly for applications that require understanding complex relationships between data entities. Unlike traditional databases that store data in tables, graph databases use nodes, edges, and properties to represent and store data, making them ideal for scenarios where relationships are as important as the data itself. This approach is particularly beneficial in fields such as social networking, where the ability to analyze connections between users can provide deep insights into social dynamics and influence patterns. As businesses increasingly seek to leverage data for competitive advantage, the demand for graph databases is expected to grow, driven by their ability to efficiently model and query interconnected data.

Type Analysis

The NoSQL software market is segmented into various types, including Document-Oriented, Key-Value Store, Column-Oriented, and Graph-Based databases. Document-oriented databases, such as MongoDB, store data in JSON-like documents, offering flexibility in data modeling and ease of use. These databases are widely used for content management systems, e-commerce applications, and real-time analytics. Their ability to handle semi-structured data and scalability features make them a popular choice among developers and enterprises seeking agile database solutions.

Key-Value Store databases, such as Redis and Amazon DynamoDB, store data as a collection of key-value pairs, providing ultra-fast read and write operations. These databases are ideal for applications requiring high-speed data retrieval, such as caching, session manag
A one percent sample of German Twitter retweet traffic
zenodo.org
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nane Kratzke; Nane Kratzke (2023). A one percent sample of German Twitter retweet traffic [Dataset]. http://doi.org/10.5281/zenodo.7669923
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7669923
Dataset updated
Mar 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nane Kratzke; Nane Kratzke
Description
This dataset includes a one percent sample of German-language Twitter retweets in Twitter raw data format. For each day, all retweets are stored in json data format (one entry per line).

The dataset was recorded using Tweepy and exported from a MongoDB database. It is intended to be imported into a MongoDB database to run analytical queries. It is not intended to be processed as is.

The dataset covers 60 consecutive days and ends on 01/25/2023.

The dataset was recorded as part of this study.

Kratzke, N. How to Find Orchestrated Trolls? A Case Study on Identifying Polarized Twitter Echo Chambers. Computers 2023, 12, 57. https://doi.org/10.3390/computers12030057
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
data.europa.eu
zenodo.org
unknown
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6832242?locale=fr
Explore at:
unknown(642961582)Available download formats
Dataset updated
Jul 12, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

Job Postings from Ireland (October 2021)

kaggle.com

zip

Updated Apr 16, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Techmap.io (2023). Job Postings from Ireland (October 2021) [Dataset]. https://www.kaggle.com/datasets/techmap/job-postings-ireland-october-2021

Explore at:

zip(56469415 bytes)Available download formats

Dataset updated

Apr 16, 2023

Authors

Techmap.io

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Area covered

Ireland

Description

Context

This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 24621 Job Postings from Ireland in October 2021.

Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.

More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.

We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.

Inspiration

This dataset was created to help data scientists and researchers across the world.

License

This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)

Available Fields

(as generated by variety.js)

+----------------------------------------------------
| key           | types   | Explanation
| ------------------------| ----------| -------------
| _id           | ObjectId | Unique ID from the MongoDB
| companyID        | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
| contact         | Object  | Map/Object with contact info from the JSON, HTML or extracted from job posting
| contact.email      | String  | Corporate email address mentioned from JSON or job posting
| contact.phone      | String  | Corporate phone address extracted from JSON or job posting
| dateCreated       | Date   | Date the job posting was created (or date scraped if creation date is not available)
| dateExpired       | Date   | Date the job posting expires
| dateScraped       | Date   | Date the job posting was scraped
| html          | String  | The raw HTML of the job description (can be plain text for some sources)
| idInSource       | String  | An id used in the source portal (unique for the source)
| json          | Object  | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
| locale         | String  | Locale extracted from the JSON or job posting (e.g., "en_US")
| locationID       | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
| name          | String  | Title or Name of the job posting
| orgAddress       | Object  | Original address data extracted from the job posting
| orgAddress.addressLine | String  | Raw address line - mostly just a city name
| orgAddress.city     | String  | City name from JSON, HTML or extracted from addressLine
| orgAddress.companyName | String  | Company name from JSON, HTML or extracted from addressLine
| orgAddress.country   | String  | Country name from JSON, HTML or extracted from addressLine
| orgAddress.countryCode | String  | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
| orgAddress.county    | String  | County name from JSON, HTML or extracted from addressLine
| orgAddress.district   | String  | (City) District name from JSON, HTML or extracted from addressLine
| orgAddress.formatted  | String  | Formatted address data extracted from the job posting
| orgAddress.geoPoint   | Object  | Map of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lat | Number  | Latitude of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lng | Number  | Longitude of geo coordinate if stated in the JSON or job posting
| orgAddress.houseNumber | String  | House number extracted from the street or from JSON, HTML or extracted from addressLine
| orgAddress.level    | Number  | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
| orgAddress.postCode   | String  | Postal code / zip code extracted from JSON, HTML or addressLine
| orgAddress.quarter   | String  | (City) Quarter name from JSON, HTML or extracted from addressLine
| orgAddress.state    | String  | State name or abbreviation from JSON, HTML or extracted from addressLine
| orgAddress.street    | String  | Street name (and maybe housen...

Screenshots and metadata for 214 reCAPTCHA challenges encountered between...
data-staging.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Pettis (2024). Screenshots and metadata for 214 reCAPTCHA challenges encountered between September 2022 - September 2023 [Dataset]. http://doi.org/10.5061/dryad.h70rxwdsr
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h70rxwdsr
Dataset updated
Jun 19, 2024
Dataset provided by
University of Wisconsin–Madison
Authors
Ben Pettis
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet. This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis bpettis@wisc.edu. A browsable and searchable version of the data is also available at https://capturingcaptcha.com Methods I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website. The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website. The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool: mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]

US Job Postings from 2023-05-05

kaggle.com

zip

Updated May 10, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Techmap.io (2023). US Job Postings from 2023-05-05 [Dataset]. https://www.kaggle.com/datasets/techmap/us-job-postings-from-2023-05-05/discussion?sort=undefined

Explore at:

zip(805159819 bytes)Available download formats

Dataset updated

May 10, 2023

Authors

Techmap.io

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Area covered

United States

Description

Context

This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 33k Job Postings from the USA on May 5th 2023.

More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.

Inspiration

This dataset was created to help data scientists and researchers across the world.

License

This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)

Content

Total Records Count: 33064 Sources: 29 job boards (174 with country-portals) such as CareerBuilder, EURES, Monster, or Linkedin Date Range: 5. May 2023 - 5. May 2023 File Extension: JSON

Available Fields

(as generated by variety.js)

+----------------------------------------------------
| key           | types   | Explanation
| ------------------------| ----------| -------------
| _id           | ObjectId | Unique ID from the MongoDB
| companyID        | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
| contact         | Object  | Map/Object with contact info from the JSON, HTML or extracted from job posting
| contact.email      | String  | Corporate email address mentioned from JSON or job posting
| contact.phone      | String  | Corporate phone address extracted from JSON or job posting
| dateCreated       | Date   | Date the job posting was created (or date scraped if creation date is not available)
| dateExpired       | Date   | Date the job posting expires
| dateScraped       | Date   | Date the job posting was scraped
| html          | String  | The raw HTML of the job description (can be plain text for some sources)
| idInSource       | String  | An id used in the source portal (unique for the source)
| json          | Object  | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
| locale         | String  | Locale extracted from the JSON or job posting (e.g., "en_US")
| locationID       | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
| name          | String  | Title or Name of the job posting
| orgAddress       | Object  | Original address data extracted from the job posting
| orgAddress.addressLine | String  | Raw address line - mostly just a city name
| orgAddress.city     | String  | City name from JSON, HTML or extracted from addressLine
| orgAddress.companyName | String  | Company name from JSON, HTML or extracted from addressLine
| orgAddress.country   | String  | Country name from JSON, HTML or extracted from addressLine
| orgAddress.countryCode | String  | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
| orgAddress.county    | String  | County name from JSON, HTML or extracted from addressLine
| orgAddress.district   | String  | (City) District name from JSON, HTML or extracted from addressLine
| orgAddress.formatted  | String  | Formatted address data extracted from the job posting
| orgAddress.geoPoint   | Object  | Map of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lat | Number  | Latitude of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lng | Number  | Longitude of geo coordinate if stated in the JSON or job posting
| orgAddress.houseNumber | String  | House number extracted from the street or from JSON, HTML or extracted from addressLine
| orgAddress.level    | Number  | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
| orgAddress.postCode   | String  | Postal code / zip code extracted from JSON, HTML or addressLine
| orgAddress.quarter   | String  | (City) Quarter name from JSON, HTML or extracted fro...

785 Million Language Translation Database for AI
kaggle.com
zip
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
Explore at:
zip(6504894854 bytes)Available download formats
Dataset updated
Aug 28, 2023
Authors
Ramakrishnan Lakshmanan
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

Key Features:

Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

Dataset Preparation: The translation ...
Inventory data for Pharmacy Website in JSON format
kaggle.com
zip
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priti Poddar (2024). Inventory data for Pharmacy Website in JSON format [Dataset]. https://www.kaggle.com/datasets/pritipoddar/inventory-data-for-pharmacy-website-in-json-format
Explore at:
zip(14761 bytes)Available download formats
Dataset updated
Oct 22, 2024
Authors
Priti Poddar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset contains inventory data for a pharmacy e-commerce website in JSON format, designed for easy integration into MongoDB databases, making it ideal for MERN stack projects. It includes 10 fields:

drugName: Name of the drug

manufacturer: Drug manufacturer

image: URL of the product image

description: Detailed description of the drug

expiryDate: Expiry date of the drug

price: Price of the drug

sideEffects: Potential side effects

disclaimer: Important legal and medical disclaimers

category: Drug classification (e.g., pain relief, antibiotics)

countInStock: Quantity of the product available in stock

This dataset is useful for developing pharmacy-related web applications, inventory management systems, or online medical stores using the MERN stack.

Do not use for production-level purposes; use for project development only. Feel free to contribute if you find any mistakes or have suggestions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Connor Coley (2023). MongoDB dump (compressed) [Dataset]. http://doi.org/10.6084/m9.figshare.4833482.v1

MongoDB dump (compressed)

Explore at:

7zAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.4833482.v1

Dataset updated

Jun 1, 2023

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Connor Coley

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model

Clear search

Close search

Google apps

Main menu

MongoDB dump (compressed)

embedded_movies

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

NoSQL Software Market Report | Global Forecast From 2025 To 2033

NoSQL Software Market Outlook

Type Analysis

A one percent sample of German Twitter retweet traffic

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Job Postings from Ireland (October 2021)

Context

Inspiration

License

Available Fields

Screenshots and metadata for 214 reCAPTCHA challenges encountered between...

US Job Postings from 2023-05-05

Context

Inspiration

License

Content

Available Fields

785 Million Language Translation Database for AI

Inventory data for Pharmacy Website in JSON format

MongoDB dump (compressed)