15 datasets found

Olist Cleaned files for MYSQL Data Base
kaggle.com
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhanu prasad Chouki (2024). Olist Cleaned files for MYSQL Data Base [Dataset]. https://www.kaggle.com/datasets/bhanuprasadchouki/olist-cleaned-files
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bhanu prasad Chouki
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description: Clean and Ready for Relational Database Import This dataset is a comprehensive collection of well-structured and meticulously cleaned data, meticulously prepared for seamless integration into a relational database. The dataset has undergone thorough data cleansing procedures to ensure that it is free from inconsistencies, missing values, and duplicate records. This guarantees a smooth and efficient data analysis experience for users, without the need for additional preprocessing steps.
c
ckanext-mysql2mongodb
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-mysql2mongodb [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-mysql2mongodb
Explore at:
Dataset updated
Jun 4, 2025
Description
The ckanext-mysql2mongodb extension for CKAN appears to facilitate the migration of data from a MySQL database to a MongoDB database. While the provided README offers limited details, it seems the extension aims to simplify the data transfer process within CKAN, likely by providing tools or scripts to extract data from MySQL and load it into MongoDB. This could be beneficial for users looking to leverage the strengths of MongoDB, such as its flexible schema and scalability, with their existing CKAN data. Key Features: MySQL to MongoDB Data Transfer: The primary function seems to move data from a MySQL database to a MongoDB instance, potentially helping users migrate their data. (Assumed based on extension name) CKAN Plugin Integration: The extension integrates directly with CKAN, making it available through the CKAN plugin architecture. This allows administrators to use the extension without needing to modify the base CKAN code. Technical Integration: The extension is enabled by adding mysql2mongodb to the ckan.plugins setting in the CKAN configuration file (/etc/ckan/default/ckan.ini by default). The README provides basic steps for installation and suggests that no configuration settings are mandatory at the moment. Benefits & Impact: Based on the limited information available, Potential benefit would be to allow data transfer from mysql to MongoDB database within the CKAN ecosystem.
Z
Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...
data.niaid.nih.gov
zenodo.org
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bener, Ayse Basar (2024). Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and KDE [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_400614
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Miranskyy, Andriy V.
Bener, Ayse Basar
Sadat, Mefta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.

File Descriptions

apache.csv - Apache Defect Rediscovery dataset

eclipse.csv - Eclipse Defect Rediscovery dataset

kde.csv - KDE Defect Rediscovery dataset

apache.relations.csv - Inter-relations of rediscovered defects of Apache

eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse

kde.relations.csv - Inter-relations of rediscovered defects of KDE

create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping

create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files

rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database

neo4j_examples.txt - Sample Neo4j queries

mysql_examples.txt - Sample MySQL queries

rediscovery_eclipse_6325.png - Output of Neo4j example #1

distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
DACOS - Dataset
zenodo.org
bin, txt, zip
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma (2023). DACOS - Dataset [Dataset]. http://doi.org/10.5281/zenodo.7570428
Explore at:
txt, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7570428
Dataset updated
Jan 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Himesh Nandani; Mootez Saad; Tushar Sharma; Himesh Nandani; Mootez Saad; Tushar Sharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DACOS - DAtaset of COde Smells

The dataset offers annotated code snippets for three code smells— multifaceted abstraction, complex method, and long parameter list.

In addition to a manually annotated dataset on potentially subjective snippets, we offer a larger set of snippets containing the snippets that are either definitely benign or smelly.

The upload contains three files :

DACOSMain.sql - This is the SQL file containing the main DACOS dataset.

DACOSExtended.sql - This is the SQL file containing the Extended DACOS dataset.

Files.zip - The zip file containing all the source code files.

Required Software

The dataset is created in MySQL. Hence a local or remote installation of MySQL is needed with privileges to create and modify schemas.

Importing the Dataset

The dataset is a self-contained SQL file. To import the dataset, run the following command:

mysql -u username -p database_name < DACOSMain.sql mysql -u username -p database_name < DACOSExtended.sql

Understanding the Datasets

Both the datasets differ in architecture. The main dataset contains a table named annotations that contains every annotation collected from users. The sample table contains the samples presented to the user for annotation. The class_metrics and method_metrics contain the tables for class and method metrics respectively. These were used to filter samples that are likely to contain smells and hence can be shown to users.

The extended dataset is created by selecting samples that are below or above the selected metric range for each smell. Hence, these samples are definitely smelly or benign. The extended version of the dataset does not contain a table for annotation since they were not presented to user. It instead has an 'entry' table where each sample is classified according to the smell it contains. The codes for identifying smells are as below:

Condition smell Id
Multifaceted Abstraction Present 1
Multifaceted Abstraction not detected 4
Long Parameter List Present 2
Long Parameter List Absent 5
Complex Method Present 3
Complex Method Absent 6
Data from: Spatial Modeling for Resources Framework (SMRF)
catalog.data.gov
gimi9.com
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Spatial Modeling for Resources Framework (SMRF) [Dataset]. https://catalog.data.gov/dataset/spatial-modeling-for-resources-framework-smrf-1db41
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
Spatial Modeling for Resources Framework (SMRF) was developed at the USDA Agricultural Research Service (ARS) in Boise, ID, and was designed to increase the flexibility of taking measured weather data and distributing the point measurements across a watershed. SMRF was developed to be used as an operational or research framework, where ease of use, efficiency, and ability to run in near real time are high priorities. Highlights Robust meteorological spatial forcing data development for physically based models The Python framework can be used for research or operational applications Parallel processing and multi-threading allow for large modeling domains at high resolution Real time and historical applications for water supply resourses Features SMRF was developed as a modular framework to enable new modules to be easily intigrated and utilized. Load data into SMRF from MySQL database, CSV files, or gridded climate models (i.e. WRF) Variables currently implemented: Air temperature; Vapor pressure; Precipitation mass, phase, density, and percent snow; Wind speed and direction; Solar radiation; Thermal radiation Output variables to NetCDF files Data queue for multithreaded application Computation tasks implemented in C Resources in this dataset:Resource Title: SMRF GitHub repository. File Name: Web Page, url: https://github.com/USDA-ARS-NWRC/smrf SMRF was designed to increase the flexibility of taking measured weather data, or atmospheric models, and distributing the data across a watershed.
n
Malaria disease and grading system dataset from public hospitals reflecting...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.4xgxd25gn
Dataset updated
Nov 10, 2023
Dataset provided by
Nasarawa State University
Authors
Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
MaRV Scripts and Dataset
zenodo.org
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14450098
Dataset updated
Dec 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

Our dataset is located at the path dataset/MaRV.json

The guidelines for replicating the study are provided below:

Requirements

1. Software Dependencies:

Python 3.10+ with packages in requirements.txt

Git: Required to clone repositories.

Java 17: RefactoringMiner requires Java 17 to perform the analysis.

PHP 8.0: Required to host the Web tool.

MySQL 8: Required to store the Web tool data.

2. Environment Variables:

Create a .env file based on .env.example in the src folder and set the variables:

CSV_PATH: Path to the CSV file containing the list of repositories to be processed.

CLONE_DIR: Directory where repositories will be cloned.

JAVA_PATH: Path to the Java executable.

REFACTORING_MINER_PATH: Path to RefactoringMiner.

Refactoring Technique Selection

1. Environment Setup:

Ensure all dependencies are installed. Install the required Python packages with:
pip install -r requirements.txt

2. Configuring the Repositories CSV:

The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

3. Executing the Script:

Configure the environment variables in the .env file and set up the repositories CSV, then run:
python3 src/run_rm.py

The RefactoringMiner output from the 126 repositories of our study is available at:
https://zenodo.org/records/14395034

4. Script Behavior:

The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.

Results and Logs:

Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.

Logs for each repository, including error messages, are saved as .log files in the same directory.

5. Count Refactorings:

To count instances for each refactoring technique, run:
python3 src/count_refactorings.py

The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

Data Gathering

To collect snippets before and after refactoring and their metadata, run:

python3 src/diff.py '[refactoring technique]'

Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

Dataset Availability:

The snippets and metadata from the 126 repositories of our study are available in the dataset directory.

To generate the SQL file for the Web tool, run:

python3 src/generate_refactorings_sql.py

Web Tool for Manual Evaluation

The Web tool scripts are available in the web directory.

Populate the data/output/snippets folder with the output of src/diff.py.

Run the sql/create_database.sql script in your database.

Import the SQL file generated by src/generate_refactorings_sql.py.

Run dataset.php to generate the MaRV dataset file.

The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
StudentMathScores
kaggle.com
Updated Jun 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Logan Henslee (2019). StudentMathScores [Dataset]. https://www.kaggle.com/loganhenslee/studentmathscores/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Logan Henslee
Description
CONTEXT

Practice Scenario: The UIW School of Engineering wants to recruit more students into their program. They will recruit students with great math scores. Also, to increase the chances of recruitment, the department will look for students who qualify for financial aid. Students who qualify for financial aid more than likely come from low socio-economic backgrounds. One way to indicate this is to view how much federal revenue a school district receives through its state. High federal revenue for a school indicates that a large portion of the student base comes from low incomes families.

The question we wish to ask is as follows: Name the school districts across the nation where their Child Nutrition Programs(c25) are federally funded between the amounts $30,000 and $50,000. And where the average math score for the school districts corresponding state is greater than or equal to the nations average score of 282.

The SQL query below in 'Top5MathTarget.sql' can be used to answer this question in MySQL. To execute this process, one would need to install MySQL to their local system and load the attached datasets below from Kaggle into their MySQL schema. The SQL query below will then join the separate tables on various key identifiers.

DATA SOURCE Data is sourced from The U.S Census Bureau and The Nations Report Card (using the NAEP Data Explorer).

Finance: https://www.census.gov/programs-surveys/school-finances/data/tables.html

Math Scores: https://www.nationsreportcard.gov/ndecore/xplore/NDE

COLUMN NOTES

All data comes from the school year 2017. Individual schools are not represented, only school districts within each state.

FEDERAL FINANCE DATA DEFINITIONS

t_fed_rev: Total federal revenue through the state to each school district.

C14- Federal revenue through the state- Title 1 (no child left behind act).

C25- Federal revenue through the state- Child Nutrition Act.

Title 1 is a program implemented in schools to help raise academic achievement for all students. The program is available to schools where at least 40% of the students come from low inccome families.

Child Nutrition Programs ensure the children are getting the food they need to grow and learn. Schools with high federal revenue to these programs indicate students that also come from low income families.

MATH SCORES DATA DEFINITIONS

Note: Mathematics, Grade 8, 2017, All Students (Total)

average_scale_score - The state's average score for eighth graders taking the NAEP math exam.
a
Russian Wiktionary parsed ruwikt20230901
academictorrents.com
bittorrent
Updated Oct 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Krizhanovsky (2023). Russian Wiktionary parsed ruwikt20230901 [Dataset]. https://academictorrents.com/details/df5f4f51a50d6ff24f5ee748a7290ae3c490eaac
Explore at:
bittorrent(540540928)Available download formats
Dataset updated
Oct 10, 2023
Dataset authored and provided by
Andrew Krizhanovsky
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Parsed Russian Wiktionary SQL dump (MySQL): ruwikt20230901_parsed.sql.7z (92 Mb). Russian Wiktionary source database: ruwikt20230901.sql.7z (416 Mb). 13 log files with errors generated by parser during parsing of the Russian Wiktionary. Russian Wiktionary () was parsed by the Wikokit software. The Wiktionary parser source code (Java) and documentation are available at GitHub: . Unpack the dump "7z e ruwikt20230901_parsed.sql.7z" Import this unpacked SQL file to the MySQL database: mysql$ CREATE DATABASE ruwikt20230901_parsed; mysql$ USE ruwikt20230901_parsed mysql$ SOURCE /path/ruwikt20230901_parsed.sql
f
A sample of datasets available from the EcoData Retriever.
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin D. Morris; Ethan P. White (2023). A sample of datasets available from the EcoData Retriever. [Dataset]. https://plos.figshare.com/articles/dataset/_A_sample_of_datasets_available_from_the_EcoData_Retriever_/1074128
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Benjamin D. Morris; Ethan P. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Tested using MySQL on a machine with 4 GB RAM and 4 x 2.4GHz processor.Includes time required to download and reformat data and import to MySQL
f
A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07...
figshare.com
application/x-rar
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Himmelstein; Stephen McLaughlin (2023). A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07 [Dataset]. http://doi.org/10.6084/m9.figshare.5231245.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5231245.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Daniel Himmelstein; Stephen McLaughlin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata for the LibGen scimag database of full-text scholarly documents. Each row of this dataset corresponds to a scholarly document in the LibGen scimag database, as identified by its DOI.scimag_dbbackup-2017-04-07.rar was downloaded from http://libgen.io/dbdumps/backup_archive/scimag_dbbackup-2017-04-07.rar. It's a compressed SQL dump of the LibGen scimag metadata database on 2017-04-07. This is the unmodified file downloaded from libgen.io. It encodes a single table named scimag.libgen-scimag-2017-04-07.tsv.xz contains a TSV version of the scimag table from scimag_dbbackup-2017-04-07.rar. It's more user-friendly because it provides access to the data without requiring MySQL, is UTF-8 encoded, and has null bytes removed.The code that downloaded and processed these datasets is at https://git.io/v7Uh4. Users should note that the TimeAdded column appears to store the modification rather than the creation date for each DOI. As discussed in https://doi.org/b9s5, this field should not be mistaken for the date of first upload to LibGen scimag.
Github BPMN Artifacts Dataset 2021
zenodo.org
explore.openaire.eu
bin
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze (2022). Github BPMN Artifacts Dataset 2021 [Dataset]. http://doi.org/10.5281/zenodo.5903352
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5903352
Dataset updated
Jan 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jasmin Türker; Jasmin Türker; Michael Völske; Michael Völske; Thomas Heinze; Thomas Heinze
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Information about 327,436 potential BPMN artifacts identified in all public Github repositories referenced in the GHTorrent dump from March 2021.

The data file is in line-delimited JSON format, with each row containing an array with the following six elements:

GHTorrent project ID

GitHub user name

GitHub repository name

GitHub branch name

Path to file inside repository

SHA1 hash of the file's contents

To get a list of retrievable URLs, use e.g. the following Python one-liner:

python3 -c 'import json; import sys; print(*[f"https://raw.githubusercontent.com/{u}/{r}/{b}/{f}" for _, u, r, b, f, _ in map(json.loads, sys.stdin)], sep=" ")' < bpmn-artifacts.jsonl > urls.txt

(using the hashes to filter out duplicates first is recommended though)
f
Using Virtuoso as an alternate triple store for a VIVO instance
vivo.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Albert; Eliza Chan; Prakesh Adekkanattu; Mohammad Mansour (2023). Using Virtuoso as an alternate triple store for a VIVO instance [Dataset]. http://doi.org/10.6084/m9.figshare.2002032.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2002032.v2
Dataset updated
May 30, 2023
Dataset provided by
VIVO
Authors
Paul Albert; Eliza Chan; Prakesh Adekkanattu; Mohammad Mansour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: For some time, the VIVO for Weill Cornell Medical College (WCMC) had struggled with both unacceptable page load times and unreliable uptime. With some individual profiles containing upwards of 800 publications, WCMC VIVO has relatively large profiles, but no profile was so large that it could account for this performance. The WCMC VIVO Implementation Team explored a number of options for improving performance including caching, better hardware, query optimization, limiting user access to large pages, using another instance of Tomcat, throttling bots, and blocking IP's issuing too many requests. But none of these avenues were fruitful. Analysis of triple stores: With the 1.7 version, VIVO ships with the Jena SDB triple store, but the SDB version of Jena is no longer supported by its developers. In April, we reviewed various published analyses and benchmarks suggesting there were alternatives to Jena such as Virtuoso that perform better than even Jena's successor, TDB. In particular, the Berlin SPARQL Benchmark v. 3.1[1] showed that Virtuoso had the strongest performance compared to the other data stores measured including BigData, BigOwlim, and Jena TDB. In addition, Virtuoso is used on dbpedia.org which serves up 3 billion triples compared to the only 12 million with WCMC's VIVO site. Whereas Jena SDB stores its triples in a MySQL database, Virtuoso manages its in a binary file. The software is available in open source and commercial editions. Configuration: In late 2014, we installed Virtuoso on a local machine and loaded data from our production VIVO. Some queries completed in about 10% of the time as compared to our production VIVO. However, we noticed that the listview queries invoked whenever profile pages were loaded were still slow. After soliciting feedback from members of both the Virtuoso and VIVO communities, we modified these queries to rely on the OPTIONAL instead of UNION construct. This modification, which wasn't possible in a Jena SDB environment, reduced by eight-fold the number of queries that the application makes of the triple store. About four or five additional steps were required for VIVO and Virtuoso to work optimally with one another; these are documented in the VIVO Duraspace wiki. Results: On March 31, WCMC launched Virtuoso in its production environment. According to our instance of New Relic, VIVO has an average page load of about four seconds and 99% uptime, both of which are dramatic improvements. There are opportunities for further tuning: the four second average includes pages such as the visualizations as well as pages served up to logged in users, which are slower than other types of pages. [1] http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/results/V7/#comparison
h
leipzig_corpora_br
huggingface.co
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bretagne (2025). leipzig_corpora_br [Dataset]. https://huggingface.co/datasets/Bretagne/leipzig_corpora_br
Explore at:
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Bretagne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
[!NOTE] Dataset origin: https://wortschatz.uni-leipzig.de/en/download/Breton

Description

The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs. The corpora are identical in… See the full description on the dataset page: https://huggingface.co/datasets/Bretagne/leipzig_corpora_br.
g
Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...
search.gesis.org
pollux-fid.de
+1more
Updated Jun 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf (2020). Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus (2014-2016) [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-2030
Explore at:
Dataset updated
Jun 29, 2020
Dataset provided by
GESIS search
GESIS, Köln
Authors
Brentel, Inga; Kampes, Céline Fabienne; Jandura, Olaf
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Description
Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.

Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.

Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.

Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530

****************English Version****************

The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey result...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Condition	smell Id
Multifaceted Abstraction Present	1
Multifaceted Abstraction not detected	4
Long Parameter List Present	2
Long Parameter List Absent	5
Complex Method Present	3
Complex Method Absent	6

Facebook

Twitter

Click to copy link

Link copied

Cite

Bhanu prasad Chouki (2024). Olist Cleaned files for MYSQL Data Base [Dataset]. https://www.kaggle.com/datasets/bhanuprasadchouki/olist-cleaned-files

Olist Cleaned files for MYSQL Data Base

"Structured, Clean, and Ready for Relational Database Integration"

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 4, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Bhanu prasad Chouki

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description: Clean and Ready for Relational Database Import This dataset is a comprehensive collection of well-structured and meticulously cleaned data, meticulously prepared for seamless integration into a relational database. The dataset has undergone thorough data cleansing procedures to ensure that it is free from inconsistencies, missing values, and duplicate records. This guarantees a smooth and efficient data analysis experience for users, without the need for additional preprocessing steps.

Clear search

Close search

Google apps

Main menu

Olist Cleaned files for MYSQL Data Base

ckanext-mysql2mongodb

Rediscovery Datasets: Connecting Duplicate Reports of Apache, Eclipse, and...

DACOS - Dataset

Data from: Spatial Modeling for Resources Framework (SMRF)

Malaria disease and grading system dataset from public hospitals reflecting...

MaRV Scripts and Dataset

Requirements

1. Software Dependencies:

2. Environment Variables:

Refactoring Technique Selection

1. Environment Setup:

2. Configuring the Repositories CSV:

3. Executing the Script:

4. Script Behavior:

5. Count Refactorings:

Data Gathering

Web Tool for Manual Evaluation

StudentMathScores

Russian Wiktionary parsed ruwikt20230901

A sample of datasets available from the EcoData Retriever.

A user-friendly extract of the LibGen scimag metadata SQL dump on 2017-04-07...

Github BPMN Artifacts Dataset 2021

Using Virtuoso as an alternate triple store for a VIVO instance

leipzig_corpora_br

Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus...

Olist Cleaned files for MYSQL Data Base

"Structured, Clean, and Ready for Relational Database Integration"