Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
sample_mflix.embedded_movies
This data set contains details on movies with genres of Western, Action, or Fantasy. Each document contains a single movie, and information such as its title, release year, and cast. In addition, documents in this collection include a plot_embedding field that contains embeddings created using OpenAI's text-embedding-ada-002 embedding model that you can use with the Atlas Search vector search feature.
Overview
This dataset offers a… See the full description on the dataset page: https://huggingface.co/datasets/MongoDB/embedded_movies.
Facebook
TwitterLifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{ _id: id (or user_id): type: data: }
Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.
Surveys Encoding
BREQ2
Why do you engage in exercise?
Code
Text
engage[SQ001]
I exercise because other people say I should
engage[SQ002]
I feel guilty when I don’t exercise
engage[SQ003]
I value the benefits of exercise
engage[SQ004]
I exercise because it’s fun
engage[SQ005]
I don’t see why I should have to exercise
engage[SQ006]
I take part in exercise because my friends/family/partner say I should
engage[SQ007]
I feel ashamed when I miss an exercise session
engage[SQ008]
It’s important to me to exercise regularly
engage[SQ009]
I can’t see why I should bother exercising
engage[SQ010]
I enjoy my exercise sessions
engage[SQ011]
I exercise because others will not be pleased with me if I don’t
engage[SQ012]
I don’t see the point in exercising
engage[SQ013]
I feel like a failure when I haven’t exercised in a while
engage[SQ014]
I think it is important to make the effort to exercise regularly
engage[SQ015]
I find exercise a pleasurable activity
engage[SQ016]
I feel under pressure from my friends/family to exercise
engage[SQ017]
I get restless if I don’t exercise regularly
engage[SQ018]
I get pleasure and satisfaction from participating in exercise
engage[SQ019]
I think exercising is a waste of time
PANAS
Indicate the extent you have felt this way over the past week
P1[SQ001]
Interested
P1[SQ002]
Distressed
P1[SQ003]
Excited
P1[SQ004]
Upset
P1[SQ005]
Strong
P1[SQ006]
Guilty
P1[SQ007]
Scared
P1[SQ008]
Hostile
P1[SQ009]
Enthusiastic
P1[SQ010]
Proud
P1[SQ011]
Irritable
P1[SQ012]
Alert
P1[SQ013]
Ashamed
P1[SQ014]
Inspired
P1[SQ015]
Nervous
P1[SQ016]
Determined
P1[SQ017]
Attentive
P1[SQ018]
Jittery
P1[SQ019]
Active
P1[SQ020]
Afraid
Personality
How Accurately Can You Describe Yourself?
Code
Text
ipip[SQ001]
Am the life of the party.
ipip[SQ002]
Feel little concern for others.
ipip[SQ003]
Am always prepared.
ipip[SQ004]
Get stressed out easily.
ipip[SQ005]
Have a rich vocabulary.
ipip[SQ006]
Don't talk a lot.
ipip[SQ007]
Am interested in people.
ipip[SQ008]
Leave my belongings around.
ipip[SQ009]
Am relaxed most of the time.
ipip[SQ010]
Have difficulty understanding abstract ideas.
ipip[SQ011]
Feel comfortable around people.
ipip[SQ012]
Insult people.
ipip[SQ013]
Pay attention to details.
ipip[SQ014]
Worry about things.
ipip[SQ015]
Have a vivid imagination.
ipip[SQ016]
Keep in the background.
ipip[SQ017]
Sympathize with others' feelings.
ipip[SQ018]
Make a mess of things.
ipip[SQ019]
Seldom feel blue.
ipip[SQ020]
Am not interested in abstract ideas.
ipip[SQ021]
Start conversations.
ipip[SQ022]
Am not interested in other people's problems.
ipip[SQ023]
Get chores done right away.
ipip[SQ024]
Am easily disturbed.
ipip[SQ025]
Have excellent ideas.
ipip[SQ026]
Have little to say.
ipip[SQ027]
Have a soft heart.
ipip[SQ028]
Often forget to put things back in their proper place.
ipip[SQ029]
Get upset easily.
ipip[SQ030]
Do not have a good imagination.
ipip[SQ031]
Talk to a lot of different people at parties.
ipip[SQ032]
Am not really interested in others.
ipip[SQ033]
Like order.
ipip[SQ034]
Change my mood a lot.
ipip[SQ035]
Am quick to understand things.
ipip[SQ036]
Don't like to draw attention to myself.
ipip[SQ037]
Take time out for others.
ipip[SQ038]
Shirk my duties.
ipip[SQ039]
Have frequent mood swings.
ipip[SQ040]
Use difficult words.
ipip[SQ041]
Don't mind being the centre of attention.
ipip[SQ042]
Feel others' emotions.
ipip[SQ043]
Follow a schedule.
ipip[SQ044]
Get irritated easily.
ipip[SQ045]
Spend time reflecting on things.
ipip[SQ046]
Am quiet around strangers.
ipip[SQ047]
Make people feel at ease.
ipip[SQ048]
Am exacting in my work.
ipip[SQ049]
Often feel blue.
ipip[SQ050]
Am full of ideas.
STAI
Indicate how you feel right now
Code
Text
STAI[SQ001]
I feel calm
STAI[SQ002]
I feel secure
STAI[SQ003]
I am tense
STAI[SQ004]
I feel strained
STAI[SQ005]
I feel at ease
STAI[SQ006]
I feel upset
STAI[SQ007]
I am presently worrying over possible misfortunes
STAI[SQ008]
I feel satisfied
STAI[SQ009]
I feel frightened
STAI[SQ010]
I feel comfortable
STAI[SQ011]
I feel self-confident
STAI[SQ012]
I feel nervous
STAI[SQ013]
I am jittery
STAI[SQ014]
I feel indecisive
STAI[SQ015]
I am relaxed
STAI[SQ016]
I feel content
STAI[SQ017]
I am worried
STAI[SQ018]
I feel confused
STAI[SQ019]
I feel steady
STAI[SQ020]
I feel pleasant
TTM
Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?
Code
Text
processes[SQ002]
I read articles to learn more about physical
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global NoSQL software market size was valued at approximately USD 6 billion in 2023 and is projected to reach around USD 20 billion by 2032, growing at a compound annual growth rate (CAGR) of 14% during the forecast period. This market is driven by the escalating need for operational efficiency, flexibility, and scalability in database management systems, particularly in enterprises dealing with vast amounts of unstructured data.
One of the primary growth factors propelling the NoSQL software market is the exponential increase in data volumes generated by various digital platforms, IoT devices, and social media. Traditional relational databases often struggle to handle this surge efficiently, prompting organizations to shift towards NoSQL databases that offer more flexibility and scalability. The ability to store and process large sets of unstructured data without needing a predefined schema makes NoSQL databases an attractive choice for modern businesses seeking agility and speed in data management.
Moreover, the proliferation of cloud computing services has significantly contributed to the growth of the NoSQL software market. Cloud-based NoSQL databases provide cost-effective, scalable, and easily accessible solutions for enterprises of all sizes. The pay-as-you-go pricing model and the capacity to scale resources based on demand have made NoSQL databases a preferred option for startups and large enterprises alike. The seamless integration of NoSQL databases with cloud infrastructure enhances operational efficiencies and reduces the complexities associated with database management.
Another critical driver is the increasing adoption of NoSQL databases in various industry verticals such as retail, BFSI, IT, and healthcare. These industries require robust data management solutions to handle large volumes of diverse data types. NoSQL databases, with their flexible data models and high performance, cater to these requirements efficiently. In the retail sector, for example, NoSQL databases are used to manage customer data, product catalogs, and transaction histories, enabling more personalized and efficient customer services.
Regionally, North America holds a significant share of the NoSQL software market due to the presence of major technology companies and a mature IT infrastructure. The rapid digital transformation across enterprises in the region, alongside substantial investments in big data analytics and cloud computing, further fuels market growth. Additionally, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, driven by the expanding IT sector, increased adoption of cloud services, and significant investments in digital technologies in countries like China and India.
Graph Databases Software has emerged as a crucial component in the landscape of NoSQL databases, particularly for applications that require understanding complex relationships between data entities. Unlike traditional databases that store data in tables, graph databases use nodes, edges, and properties to represent and store data, making them ideal for scenarios where relationships are as important as the data itself. This approach is particularly beneficial in fields such as social networking, where the ability to analyze connections between users can provide deep insights into social dynamics and influence patterns. As businesses increasingly seek to leverage data for competitive advantage, the demand for graph databases is expected to grow, driven by their ability to efficiently model and query interconnected data.
The NoSQL software market is segmented into various types, including Document-Oriented, Key-Value Store, Column-Oriented, and Graph-Based databases. Document-oriented databases, such as MongoDB, store data in JSON-like documents, offering flexibility in data modeling and ease of use. These databases are widely used for content management systems, e-commerce applications, and real-time analytics. Their ability to handle semi-structured data and scalability features make them a popular choice among developers and enterprises seeking agile database solutions.
Key-Value Store databases, such as Redis and Amazon DynamoDB, store data as a collection of key-value pairs, providing ultra-fast read and write operations. These databases are ideal for applications requiring high-speed data retrieval, such as caching, session manag
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
Facebook
TwitterThis dataset includes a one percent sample of German-language Twitter retweets in Twitter raw data format. For each day, all retweets are stored in json data format (one entry per line).
The dataset was recorded using Tweepy and exported from a MongoDB database. It is intended to be imported into a MongoDB database to run analytical queries. It is not intended to be processed as is.
The dataset covers 60 consecutive days and ends on 01/25/2023.
The dataset was recorded as part of this study.
Kratzke, N. How to Find Orchestrated Trolls? A Case Study on Identifying Polarized Twitter Echo Chambers. Computers 2023, 12, 57. https://doi.org/10.3390/computers12030057
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 24621 Job Postings from Ireland in October 2021.
Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.
More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.
We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.
This dataset was created to help data scientists and researchers across the world.
This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)
(as generated by variety.js)
+----------------------------------------------------
| key | types | Explanation
| ------------------------| ----------| -------------
| _id | ObjectId | Unique ID from the MongoDB
| companyID | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
| contact | Object | Map/Object with contact info from the JSON, HTML or extracted from job posting
| contact.email | String | Corporate email address mentioned from JSON or job posting
| contact.phone | String | Corporate phone address extracted from JSON or job posting
| dateCreated | Date | Date the job posting was created (or date scraped if creation date is not available)
| dateExpired | Date | Date the job posting expires
| dateScraped | Date | Date the job posting was scraped
| html | String | The raw HTML of the job description (can be plain text for some sources)
| idInSource | String | An id used in the source portal (unique for the source)
| json | Object | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
| locale | String | Locale extracted from the JSON or job posting (e.g., "en_US")
| locationID | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
| name | String | Title or Name of the job posting
| orgAddress | Object | Original address data extracted from the job posting
| orgAddress.addressLine | String | Raw address line - mostly just a city name
| orgAddress.city | String | City name from JSON, HTML or extracted from addressLine
| orgAddress.companyName | String | Company name from JSON, HTML or extracted from addressLine
| orgAddress.country | String | Country name from JSON, HTML or extracted from addressLine
| orgAddress.countryCode | String | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
| orgAddress.county | String | County name from JSON, HTML or extracted from addressLine
| orgAddress.district | String | (City) District name from JSON, HTML or extracted from addressLine
| orgAddress.formatted | String | Formatted address data extracted from the job posting
| orgAddress.geoPoint | Object | Map of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lat | Number | Latitude of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lng | Number | Longitude of geo coordinate if stated in the JSON or job posting
| orgAddress.houseNumber | String | House number extracted from the street or from JSON, HTML or extracted from addressLine
| orgAddress.level | Number | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
| orgAddress.postCode | String | Postal code / zip code extracted from JSON, HTML or addressLine
| orgAddress.quarter | String | (City) Quarter name from JSON, HTML or extracted from addressLine
| orgAddress.state | String | State name or abbreviation from JSON, HTML or extracted from addressLine
| orgAddress.street | String | Street name (and maybe housen...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
In Chapter 3 of my dissertation (tentatively titled " Becoming Users:Layers of People, Technology, and Power on the Internet. "), I describe how online user activities are datafied and monetized in subtle and often obfuscated ways. The chapter focuses on Google’s reCAPTCHA, a popular implementation of a CAPTCHA challenge. A CAPTCHA, or “Completely Automated Turning test to tell Computers and Humans Apart” is a simple task or challenge which is intended to differentiate between genuine human users and those who may be using software or other automated means to interact maliciously with a website, such as for spam, mass data scraping, or denial of service attacks. reCAPTCHA challenges are increasingly being hidden from direct view of the user, and instead assessing our mouse movements, browsing patterns, and other data to evaluate the likelihood that we are “authentic” users. These hidden challenges raise the stakes of understanding our own construction as Users because they obfuscate practices of surveillance and the ways that our activities as users are commodified by large corporations (Pettis, 2023). By studying the specifics of how such data collection works—that is, how we’re called upon and situated as Users—we can make more informed decisions about how we engage with the contemporary internet. This data set contains metadata for the 214 reCAPTCHA elements that I encountered during my personal use of the Web for the period of one year (September 2022 through September 2023). Of these reCAPTCHAs, 137 were visible challenges—meaning that there was some indication of the presence of a reCAPTCHA challenge. The remaining 77 reCAPTCHAs were entirely hidden on the page. If I had not been running my browser extension, I would likely never have been aware of the use of a reCAPTCHA on the page. The data set also includes screenshots for 174 of the reCAPTCHAs. Screenshots that contain sensitive or private information have been excluded from public access. Researchers can request access to these additional files by contacting Ben Pettis bpettis@wisc.edu. A browsable and searchable version of the data is also available at https://capturingcaptcha.com Methods I developed a custom Google Chrome extension which detects when a page contains a reCAPTCHA and prompts the user to save a screenshot or screen recording while also collecting basic metadata. During Summer 2022, I began work on this website to collate and present the screen captures that I save throughout the year. The purpose of collecting these examples of websites where reCAPTCHAs appear is to understand how this Web element is situated within websites and presented to users, along with sketching out the frequency of their use and on what kinds of websites. Given that I will only be collecting records of my own interactions with reCAPTCHAs, this will not be a comprehensive sample that I can generalize as representative of all Web users. Though my experiences of the reCAPTCHA will differ from those of any other person, this collection will nevertheless be useful for demonstrating how the interface element may be embedded within websites and presented to users. Following Niels Brügger’s descriptions of Web history methods, these screen capture techniques provide an effective way to preserve a portion of the Web as it was actually encountered by a person, as opposed to methods such as automated scraping. Therefore my dissertation offers a methodological contribution to Web historians by demonstrating a technique for identifying and preserving a representation of one Web element within a page, as opposed to focusing an analysis on a whole page or entire website. The browser extension is configured to store data in a cloud-based document database running in MongoDB Atlas. Any screenshots or video recordings are uploaded to a Google Cloud Storage bucket. Both the database and cloud storage bucket are private and are restricted from direct access. The data and screenshots are viewable and searchable at https://capturingcaptcha.com. This data set represents an export of the database as of June 10, 2024. After this date, it is possible that data collection will be resumed, causing more information to be displayed in the online website. The data was exported from the database to a single JSON file (lines format) using the mongoexport command line tool: mongoexport --uri mongodb+srv://[database-url].mongodb.net/production --collection submissions --out captcha-out.json --username [databaseuser]
Facebook
Twitterhttps://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
The global Data Base Management Systems market was valued at USD 50.5 billion in 2022 and is projected to reach USD 120.6 Billion by 2030, registering a CAGR of 11.5 % for the forecast period 2023-2030. Factors Affecting Data Base Management Systems Market Growth
Growing inclination of organizations towards adoption of advanced technologies like cloud-based technology favours the growth of global DBMS market
The cloud-based data base management system solutions offer the organizations with an ability to scale their database infrastructure up or down as per requirement. In a crucial business environment data volume can vary over time. Here, the cloud allows organizations to allocate resources in a dynamic and systematic manner, thereby, ensuring optimal performance without underutilization. In addition, these cloud-based solutions are cost-efficient. As, these cloud-based DBMS solutions eliminate the need for companies to maintain and invest in physical infrastructure and hardware. It helps in reducing ongoing operational costs and upfront capital expenditures. Organizations can choose pay-as-you-go pricing models, where they need to pay only for the resources they consume. Therefore, it has been a cost-efficient option for both smaller businesses and large-enterprises. Moreover, the cloud-based data base management system platforms usually come with management tools which streamline administrative tasks such as backup, provisioning, recovery, and monitoring. It allows IT teams to concentrate on more of strategic tasks rather than routine maintenance activities, thereby, enhancing operational efficiency. Whereas, these cloud-based data base management systems allow users to remote access and collaboration among teams, irrespective of their physical locations. Thus, in regards with today's work environment, which focuses on distributed and remote workforces. These cloud-based DBMS solution enables to access data and update in real-time through authorized personnel, allowing collaboration and better decision-making. Thus, owing to all the above factors, the rising adoption of advanced technologies like cloud-based DBMS is favouring the market growth.
Availability of open-source solutions is likely to restrain the global data base management systems market growth
Open-source data base management system solutions such as PostgreSQL, MongoDB, and MySQL, offer strong functionality at minimal or no licensing costs. It makes open-source solutions an attractive option for companies, especially start-ups or smaller businesses with limited budgets. As these open-source solutions offer similar capabilities to various commercial DBMS offerings, various organizations may opt for this solutions in order to save costs. The open-source solutions may benefit from active developer communities which contribute to their development, enhancement, and maintenance. This type of collaborative environment supports continuous innovation and improvement, which results into solutions that are slightly competitive with commercial offerings in terms of performance and features. Thus, the open-source solutions create competition for commercial DBMS market, they thrive in the market by offering unique value propositions, addressing needs of organizations which prioritize professional support, seamless integration into complex IT ecosystems, and advanced features. Introduction of Data Base Management Systems
A Database Management System (DBMS) is a software which is specifically designed to organize and manage data in a structured manner. This system allows users to create, modify, and query a database, and also manage the security and access controls for that particular database. The DBMS offers tools for creating and modifying data models, that define the structure and relationships of data in a database. This system is also responsible for storing and retrieving data from the database, and also provide several methods for searching and querying the data. The data base management system also offers mechanisms to control concurrent access to the database, in order to ensure that number of users may access the data. The DBMS provides tools to enforce security constraints and data integrity, such as the constraints on the value of data and access controls that restricts who can access the data. The data base management system also provides mechanisms for recovering and backing up the data when a system failure occurs....
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset is an excerpt of our web scraping activities at Techmap.io and contains a sample of 33k Job Postings from the USA on May 5th 2023.
Techmap is a workplace search engine to help job-seekers find companies using specific technologies in their neighborhood. To identify the technologies used in companies we've collected and filtered job postings from all over the world and identified relevant technologies and workplace characteristics. In the process, we've charted technologies used in companies from different sources and built an extensive technology knowledge graph.
More job posting data exports starting from January 2020 can be bought from us as monthly, weekly, or daily exports.
We created this dataset by scraping multiple international sources and exporting all job ads from our MongoDB database using mongoexport. By default mongoexport writes data using one JSON document for every MongoDB document.
This dataset was created to help data scientists and researchers across the world.
This work is licensed under CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International)
Total Records Count: 33064 Sources: 29 job boards (174 with country-portals) such as CareerBuilder, EURES, Monster, or Linkedin Date Range: 5. May 2023 - 5. May 2023 File Extension: JSON
(as generated by variety.js)
+----------------------------------------------------
| key | types | Explanation
| ------------------------| ----------| -------------
| _id | ObjectId | Unique ID from the MongoDB
| companyID | ObjectId | ID to a company document in our MongoDB (unique for company but not unique for jobs)
| contact | Object | Map/Object with contact info from the JSON, HTML or extracted from job posting
| contact.email | String | Corporate email address mentioned from JSON or job posting
| contact.phone | String | Corporate phone address extracted from JSON or job posting
| dateCreated | Date | Date the job posting was created (or date scraped if creation date is not available)
| dateExpired | Date | Date the job posting expires
| dateScraped | Date | Date the job posting was scraped
| html | String | The raw HTML of the job description (can be plain text for some sources)
| idInSource | String | An id used in the source portal (unique for the source)
| json | Object | JSON found in the HTML page (schemaOrg contains a schem.org JobPosting and pageData1-3 source-specific json)
| locale | String | Locale extracted from the JSON or job posting (e.g., "en_US")
| locationID | ObjectId | ID to a location document in our MongoDB (unique for company but not unique for jobs)
| name | String | Title or Name of the job posting
| orgAddress | Object | Original address data extracted from the job posting
| orgAddress.addressLine | String | Raw address line - mostly just a city name
| orgAddress.city | String | City name from JSON, HTML or extracted from addressLine
| orgAddress.companyName | String | Company name from JSON, HTML or extracted from addressLine
| orgAddress.country | String | Country name from JSON, HTML or extracted from addressLine
| orgAddress.countryCode | String | ISO 3166 (2 letter) country code from JSON, HTML or extracted from addressLine
| orgAddress.county | String | County name from JSON, HTML or extracted from addressLine
| orgAddress.district | String | (City) District name from JSON, HTML or extracted from addressLine
| orgAddress.formatted | String | Formatted address data extracted from the job posting
| orgAddress.geoPoint | Object | Map of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lat | Number | Latitude of geo coordinate if stated in the JSON or job posting
| orgAddress.geoPoint.lng | Number | Longitude of geo coordinate if stated in the JSON or job posting
| orgAddress.houseNumber | String | House number extracted from the street or from JSON, HTML or extracted from addressLine
| orgAddress.level | Number | Granularity of address (Street-level: 2, PostCode-Level: 3, City-Level: 4, ...)
| orgAddress.postCode | String | Postal code / zip code extracted from JSON, HTML or addressLine
| orgAddress.quarter | String | (City) Quarter name from JSON, HTML or extracted fro...
Facebook
Twittermy mongoDB data has the following fields
TimeStamp: This field stores the date and time when the data point was recorded.
data.line_speed: This field probably represents the current line speed of the machinery in the manufacturing process. Line speed is the speed at which the production line or conveyor belt is moving.
data.active_power: This field likely contains the active power consumption of the machinery. Active power refers to the actual power being used to perform useful work.
data.line_speed_nom: This is the nominal or expected line speed for the machinery. It’s useful for comparing the actual line speed to the expected speed.
data.machine_status: This field likely stores the current status of the machine, such as “running,” “idle,” “maintenance,” etc. It gives insight into whether the machine is operational.
data.progress_remaining_minute: This field possibly indicates the estimated time remaining for the current process step or task, measured in minutes.
data.final_die_diameter: This is the diameter of a die used in the manufacturing process. Dies are often used in shaping materials.
data.tk1_carrier1_active and data.tk1_carrier2_active: These fields represent the activity status of two carriers or conveyors in the manufacturing process. They could be binary indicators of whether these carriers are active or not.
data.tk1_progress_remaining_time: Similar to data.progress_remaining_minute, this field indicate the remaining time for a specific task associated with tk1 (which likely stands for “task 1”).
data.tk1_progress_length, data.tk1_progress_length_target: These fields pertain to the length of progress made in task 1 and the target length for task 1, respectively.
11.data.tk2_progress_remaining_time: Just like the tk1 related fields, these indicate the remaining time for a specific task associated with tk2.
data.tk2_progress_length**, data.tk2_progress_length_target: Similarly, these fields relate to the length of progress made in task 2 and the target length for task 2.
a document is added every 10 second into the database
examples of a document: 2023-07-13T13:14:31.467Z 12 238.8645325 12 0 9.18E-41 2.25 1 0 11.64297104 3417.060547 11800 0 0 28500
progress_length will change till it hits length target before turning 0 then it increases again. the progress length target will remain the same until new target is set
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.
Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB
Key Features:
Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.
Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.
Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.
Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.
Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.
Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.
Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.
Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.
The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.
Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.
Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.
Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.
Dataset Preparation: The translation ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Jira is an issue tracking system that supports software companies (among other types of companies) with managing their projects, community, and processes. This dataset is a collection of public Jira repositories downloaded from the internet using the Jira API V2. We collected data from 16 pubic Jira repositories containing 1822 projects and 2.7 million issues. Included in this data are historical records of 32 million changes, 9 million comments, and 1 million issue links that connect the issues in complex ways. This artefact repository contains the data as a MongoDB dump, the scripts used to download the data, the scripts used to interpret the data, and qualitative work conducted to make the data more approachable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide a manual classification of the application domain of 5,000 GitHub repositories (the most popular ones, by number of stars, on January, 2017).
We classified each system in one of the following application domains:
Application software: systems that provide functionalities to end-users, like browsers and text editors (e.g., WordPress/WordPress and adobe/brackets).
System software: systems that provide services and infrastructure to other systems, like operating systems, middleware, and databases (e.g., torvalds/linux and mongodb/mongo).
Web libraries and frameworks (e.g., twbs/bootstrap and angular/angular.js).
Non-web libraries and frameworks (e.g., google/guava and facebook/fresco).
Software tools: systems that support development tasks, like IDEs, package managers, and compilers (e.g., Homebrew/homebrew and git/git).
Documentation: repositories with documentation, tutorials, source code examples, etc. (e.g., iluwatar/java-design-patterns).
To cite the dataset, please use the following paper (which proposes and uses a first dataset version):
Hudson Borges, Andre Hora, Marco Tulio Valente. Understanding the Factors that Impact the Popularity of GitHub Repositories. In 32nd IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 334-344, 2016.
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains inventory data for a pharmacy e-commerce website in JSON format, designed for easy integration into MongoDB databases, making it ideal for MERN stack projects. It includes 10 fields:
This dataset is useful for developing pharmacy-related web applications, inventory management systems, or online medical stores using the MERN stack.
Do not use for production-level purposes; use for project development only. Feel free to contribute if you find any mistakes or have suggestions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This mongodump contains four collections associated with http://dx.doi.org/10.1021/acscentsci.7b00064 :reaction_examples/lowe_1976-2013_USPTOgrants - a collection of reaction SMILES extracted from USPTO grants by Daniel Lowereaction_examples/lowe_1976-2013_USPTOgrants_reactions - an incomplete collection of reactions extracted from USPTO grants by Daniel Lowe, containing some additional information about reagents/catalysts/solvents where knownaskcos_transforms/lowe_refs_general_v3 - a collection of highly-general reaction SMARTS strings extracted from the USPTO smilesprediction/candidate_edits_8_9_16 - a collection of reaction examples with possible products enumerated, used as input for a machine learning model