https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description: Clean and Ready for Relational Database Import This dataset is a comprehensive collection of well-structured and meticulously cleaned data, meticulously prepared for seamless integration into a relational database. The dataset has undergone thorough data cleansing procedures to ensure that it is free from inconsistencies, missing values, and duplicate records. This guarantees a smooth and efficient data analysis experience for users, without the need for additional preprocessing steps.
The ckanext-mysql2mongodb extension for CKAN appears to facilitate the migration of data from a MySQL database to a MongoDB database. While the provided README offers limited details, it seems the extension aims to simplify the data transfer process within CKAN, likely by providing tools or scripts to extract data from MySQL and load it into MongoDB. This could be beneficial for users looking to leverage the strengths of MongoDB, such as its flexible schema and scalability, with their existing CKAN data. Key Features: MySQL to MongoDB Data Transfer: The primary function seems to move data from a MySQL database to a MongoDB instance, potentially helping users migrate their data. (Assumed based on extension name) CKAN Plugin Integration: The extension integrates directly with CKAN, making it available through the CKAN plugin architecture. This allows administrators to use the extension without needing to modify the base CKAN code. Technical Integration: The extension is enabled by adding mysql2mongodb to the ckan.plugins setting in the CKAN configuration file (/etc/ckan/default/ckan.ini by default). The README provides basic steps for installation and suggests that no configuration settings are mandatory at the moment. Benefits & Impact: Based on the limited information available, Potential benefit would be to allow data transfer from mysql to MongoDB database within the CKAN ecosystem.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present three defect rediscovery datasets mined from Bugzilla. The datasets capture data for three groups of open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999-2017) to capture the inter-relationships among duplicate defects.
File Descriptions
apache.csv - Apache Defect Rediscovery dataset
eclipse.csv - Eclipse Defect Rediscovery dataset
kde.csv - KDE Defect Rediscovery dataset
apache.relations.csv - Inter-relations of rediscovered defects of Apache
eclipse.relations.csv - Inter-relations of rediscovered defects of Eclipse
kde.relations.csv - Inter-relations of rediscovered defects of KDE
create_and_populate_neo4j_objects.cypher - Populates Neo4j graphDB by importing all the data from the CSV files. Note that you have to set dbms.import.csv.legacy_quote_escaping configuration setting to false to load the CSV files as per https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_dbms.import.csv.legacy_quote_escaping
create_and_populate_mysql_objects.sql - Populates MySQL RDBMS by importing all the data from the CSV files
rediscovery_db_mysql.zip - For your convenience, we also provide full backup of the MySQL database
neo4j_examples.txt - Sample Neo4j queries
mysql_examples.txt - Sample MySQL queries
rediscovery_eclipse_6325.png - Output of Neo4j example #1
distinct_attrs.csv - Distinct values of bug_status, resolution, priority, severity for each project
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DACOS - DAtaset of COde Smells
The dataset offers annotated code snippets for three code smells— multifaceted abstraction, complex method, and long parameter list.
In addition to a manually annotated dataset on potentially subjective snippets, we offer a larger set of snippets containing the snippets that are either definitely benign or smelly.
The upload contains three files :
Required Software
The dataset is created in MySQL. Hence a local or remote installation of MySQL is needed with privileges to create and modify schemas.
Importing the Dataset
The dataset is a self-contained SQL file. To import the dataset, run the following command:
mysql -u username -p database_name < DACOSMain.sql
mysql -u username -p database_name < DACOSExtended.sql
Understanding the Datasets
Both the datasets differ in architecture. The main dataset contains a table named annotations that contains every annotation collected from users. The sample table contains the samples presented to the user for annotation. The class_metrics and method_metrics contain the tables for class and method metrics respectively. These were used to filter samples that are likely to contain smells and hence can be shown to users.
The extended dataset is created by selecting samples that are below or above the selected metric range for each smell. Hence, these samples are definitely smelly or benign. The extended version of the dataset does not contain a table for annotation since they were not presented to user. It instead has an 'entry' table where each sample is classified according to the smell it contains. The codes for identifying smells are as below:
Condition | smell Id |
---|---|
Multifaceted Abstraction Present | 1 |
Multifaceted Abstraction not detected | 4 |
Long Parameter List Present | 2 |
Long Parameter List Absent | 5 |
Complex Method Present | 3 |
Complex Method Absent | 6 |
Spatial Modeling for Resources Framework (SMRF) was developed at the USDA Agricultural Research Service (ARS) in Boise, ID, and was designed to increase the flexibility of taking measured weather data and distributing the point measurements across a watershed. SMRF was developed to be used as an operational or research framework, where ease of use, efficiency, and ability to run in near real time are high priorities. Highlights Robust meteorological spatial forcing data development for physically based models The Python framework can be used for research or operational applications Parallel processing and multi-threading allow for large modeling domains at high resolution Real time and historical applications for water supply resourses Features SMRF was developed as a modular framework to enable new modules to be easily intigrated and utilized. Load data into SMRF from MySQL database, CSV files, or gridded climate models (i.e. WRF) Variables currently implemented: Air temperature; Vapor pressure; Precipitation mass, phase, density, and percent snow; Wind speed and direction; Solar radiation; Thermal radiation Output variables to NetCDF files Data queue for multithreaded application Computation tasks implemented in C Resources in this dataset:Resource Title: SMRF GitHub repository. File Name: Web Page, url: https://github.com/USDA-ARS-NWRC/smrf SMRF was designed to increase the flexibility of taking measured weather data, or atmospheric models, and distributing the data across a watershed.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.
Our dataset is located at the path dataset/MaRV.json
The guidelines for replicating the study are provided below:
requirements.txt
.env
file based on .env.example
in the src
folder and set the variables:
CSV_PATH
: Path to the CSV file containing the list of repositories to be processed.CLONE_DIR
: Directory where repositories will be cloned.JAVA_PATH
: Path to the Java executable.REFACTORING_MINER_PATH
: Path to RefactoringMiner.pip install -r requirements.txt
CSV_PATH
should contain a column named name
with GitHub repository names (format: username/repo
)..env
file and set up the repositories CSV, then run:
python3 src/run_rm.py
CLONE_DIR
, retrieves the default branch, and runs RefactoringMiner to analyze it..json
files in CLONE_DIR
..log
files in the same directory.python3 src/count_refactorings.py
refactoring_count_by_type_and_file
, shows the number of refactorings for each technique, grouped by repository.To collect snippets before and after refactoring and their metadata, run:
python3 src/diff.py '[refactoring technique]'
Replace [refactoring technique]
with the desired technique name (e.g., Extract Method
).
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
Dataset Availability:
dataset
directory.To generate the SQL file for the Web tool, run:
python3 src/generate_refactorings_sql.py
web
directory.data/output/snippets
folder with the output of src/diff.py
.sql/create_database.sql
script in your database.src/generate_refactorings_sql.py
.dataset.php
to generate the MaRV dataset file.dataset
directory of the replication package.CONTEXT
Practice Scenario: The UIW School of Engineering wants to recruit more students into their program. They will recruit students with great math scores. Also, to increase the chances of recruitment, the department will look for students who qualify for financial aid. Students who qualify for financial aid more than likely come from low socio-economic backgrounds. One way to indicate this is to view how much federal revenue a school district receives through its state. High federal revenue for a school indicates that a large portion of the student base comes from low incomes families.
The question we wish to ask is as follows: Name the school districts across the nation where their Child Nutrition Programs(c25) are federally funded between the amounts $30,000 and $50,000. And where the average math score for the school districts corresponding state is greater than or equal to the nations average score of 282.
The SQL query below in 'Top5MathTarget.sql' can be used to answer this question in MySQL. To execute this process, one would need to install MySQL to their local system and load the attached datasets below from Kaggle into their MySQL schema. The SQL query below will then join the separate tables on various key identifiers.
DATA SOURCE Data is sourced from The U.S Census Bureau and The Nations Report Card (using the NAEP Data Explorer).
Finance: https://www.census.gov/programs-surveys/school-finances/data/tables.html
Math Scores: https://www.nationsreportcard.gov/ndecore/xplore/NDE
COLUMN NOTES
All data comes from the school year 2017. Individual schools are not represented, only school districts within each state.
FEDERAL FINANCE DATA DEFINITIONS
t_fed_rev: Total federal revenue through the state to each school district.
C14- Federal revenue through the state- Title 1 (no child left behind act).
C25- Federal revenue through the state- Child Nutrition Act.
Title 1 is a program implemented in schools to help raise academic achievement for all students. The program is available to schools where at least 40% of the students come from low inccome families.
Child Nutrition Programs ensure the children are getting the food they need to grow and learn. Schools with high federal revenue to these programs indicate students that also come from low income families.
MATH SCORES DATA DEFINITIONS
Note: Mathematics, Grade 8, 2017, All Students (Total)
average_scale_score - The state's average score for eighth graders taking the NAEP math exam.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Parsed Russian Wiktionary SQL dump (MySQL): ruwikt20230901_parsed.sql.7z (92 Mb). Russian Wiktionary source database: ruwikt20230901.sql.7z (416 Mb). 13 log files with errors generated by parser during parsing of the Russian Wiktionary. Russian Wiktionary () was parsed by the Wikokit software. The Wiktionary parser source code (Java) and documentation are available at GitHub: . Unpack the dump "7z e ruwikt20230901_parsed.sql.7z" Import this unpacked SQL file to the MySQL database: mysql$ CREATE DATABASE ruwikt20230901_parsed; mysql$ USE ruwikt20230901_parsed mysql$ SOURCE /path/ruwikt20230901_parsed.sql
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the metadata for the LibGen scimag database of full-text scholarly documents. Each row of this dataset corresponds to a scholarly document in the LibGen scimag database, as identified by its DOI.scimag_dbbackup-2017-04-07.rar was downloaded from http://libgen.io/dbdumps/backup_archive/scimag_dbbackup-2017-04-07.rar. It's a compressed SQL dump of the LibGen scimag metadata database on 2017-04-07. This is the unmodified file downloaded from libgen.io. It encodes a single table named scimag.libgen-scimag-2017-04-07.tsv.xz contains a TSV version of the scimag table from scimag_dbbackup-2017-04-07.rar. It's more user-friendly because it provides access to the data without requiring MySQL, is UTF-8 encoded, and has null bytes removed.The code that downloaded and processed these datasets is at https://git.io/v7Uh4. Users should note that the TimeAdded column appears to store the modification rather than the creation date for each DOI. As discussed in https://doi.org/b9s5, this field should not be mistaken for the date of first upload to LibGen scimag.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information about 327,436 potential BPMN artifacts identified in all public Github repositories referenced in the GHTorrent dump from March 2021.
The data file is in line-delimited JSON format, with each row containing an array with the following six elements:
To get a list of retrievable URLs, use e.g. the following Python one-liner:
python3 -c 'import json; import sys; print(*[f"https://raw.githubusercontent.com/{u}/{r}/{b}/{f}" for _, u, r, b, f, _ in map(json.loads, sys.stdin)], sep="
")' < bpmn-artifacts.jsonl > urls.txt
(using the hashes to filter out duplicates first is recommended though)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: For some time, the VIVO for Weill Cornell Medical College (WCMC) had struggled with both unacceptable page load times and unreliable uptime. With some individual profiles containing upwards of 800 publications, WCMC VIVO has relatively large profiles, but no profile was so large that it could account for this performance. The WCMC VIVO Implementation Team explored a number of options for improving performance including caching, better hardware, query optimization, limiting user access to large pages, using another instance of Tomcat, throttling bots, and blocking IP's issuing too many requests. But none of these avenues were fruitful. Analysis of triple stores: With the 1.7 version, VIVO ships with the Jena SDB triple store, but the SDB version of Jena is no longer supported by its developers. In April, we reviewed various published analyses and benchmarks suggesting there were alternatives to Jena such as Virtuoso that perform better than even Jena's successor, TDB. In particular, the Berlin SPARQL Benchmark v. 3.1[1] showed that Virtuoso had the strongest performance compared to the other data stores measured including BigData, BigOwlim, and Jena TDB. In addition, Virtuoso is used on dbpedia.org which serves up 3 billion triples compared to the only 12 million with WCMC's VIVO site. Whereas Jena SDB stores its triples in a MySQL database, Virtuoso manages its in a binary file. The software is available in open source and commercial editions. Configuration: In late 2014, we installed Virtuoso on a local machine and loaded data from our production VIVO. Some queries completed in about 10% of the time as compared to our production VIVO. However, we noticed that the listview queries invoked whenever profile pages were loaded were still slow. After soliciting feedback from members of both the Virtuoso and VIVO communities, we modified these queries to rely on the OPTIONAL instead of UNION construct. This modification, which wasn't possible in a Jena SDB environment, reduced by eight-fold the number of queries that the application makes of the triple store. About four or five additional steps were required for VIVO and Virtuoso to work optimally with one another; these are documented in the VIVO Duraspace wiki. Results: On March 31, WCMC launched Virtuoso in its production environment. According to our instance of New Relic, VIVO has an average page load of about four seconds and 99% uptime, both of which are dramatic improvements. There are opportunities for further tuning: the four second average includes pages such as the visualizations as well as pages served up to logged in users, which are slower than other types of pages. [1] http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/results/V7/#comparison
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[!NOTE] Dataset origin: https://wortschatz.uni-leipzig.de/en/download/Breton
Description
The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs. The corpora are identical in… See the full description on the dataset page: https://huggingface.co/datasets/Bretagne/leipzig_corpora_br.
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Bei dem aufbereiteten Längsschnitt-Datensatzes 2014 bis 2016 handelt es sich um „Big-Data“, weshalb der Gesamtdatensatz nur in Form einer Datenbank (MySQL) verfügbar sein wird. In dieser Datenbank liegt die Information verschiedener Variablen eines Befragten untereinander. Die vorliegende Publikation umfasst eine SQL-Datenbank mit den Meta-Daten des Sample des Gesamtdatensatzes, das einen Ausschnitt der verfügbaren Variablen des Gesamtdatensatzes darstellt und die Struktur der aufbereiteten Daten darlegen soll, und eine Datendokumentation des Samples. Für diesen Zweck beinhaltet das Sample alle Variablen der Soziodemographie, dem Freizeitverhalten, der Zusatzinformation zu einem Befragten und dessen Haushalt sowie den interviewspezifischen Variablen und Gewichte. Lediglich bei den Variablen bezüglich der Mediennutzung des Befragten, handelt es sich um eine kleine Auswahl: Für die Onlinemediennutzung wurden die Variablen aller Gesamtangebote sowie der Einzelangebote der Genre Politik und Digital aufgenommen. Die Mediennutzung von Radio, Print und TV wurde im Sample nicht berücksichtigt, da deren Struktur anhand der veröffentlichten Längsschnittdaten der Media-Analyse MA Radio, MA Pressemedien und MA Intermedia nachvollzogen werden kann.
Die Datenbank mit den tatsächlichen Befragungsdaten wäre auf Grund der Größe des Datenmaterials bereits im kritischen Bereich der Dateigröße für den normalen Up- und Download. Die tatsächlichen Befragungsergebnisse, die zur Analyse nötig sind, werden dann 2021 in Form des Gesamtdatensatzes der Media-Analyse-Daten: IntermediaPlus (2014-2016) im DBK bei GESIS veröffentlicht werden.
Die Daten sowie deren Datenaufbereitung sind ein Vorschlag eines Best-Practice Cases für Big-Data Management bzw. den Umgang mit Big-Data in den Sozialwissenschaften und mit sozialwissenschaftlichen Daten. Unter Verwendung der GESIS Software CharmStats, die im Rahmen dieses Projektes um Big-Data Features erweitert wurde, erfolgt die Dokumentation und Herstellung der Transparenz der Harmonisierungsarbeit. Durch ein Python-Skript sowie ein html-Template wurde der Arbeitsprozess um und mit CharmStats zudem stärker automatisiert.
Der aufbereitete Längsschnitt des Gesamtdatensatzes der MA IntermediaPlus für 2014 bis 2016 wird 2021 in Kooperation mit GESIS herausgegeben werden und den FAIR-Prinzipien (Wilkinson et al. 2016) entsprechend verfügbar gemacht werden. Ziel ist es durch die Harmonisierung der einzelnen Querschnitte die Datenquelle der Media-Analyse, die im Rahmen des Dissertationsprojektes „Angebots- und Publikumsfragmentierung online“ durch Inga Brentel und Céline Fabienne Kampes erfolgt, für Forschung zum sozialen und medialen Wandel in der Bundesrepublik Deutschland zugänglich zu machen.
Künftige Studiennummer des Gesamtdatensatzes der IndermediaPlus im DBK der GESIS: ZA5769 (Version 1-0-0) und der doi: https://dx.doi.org/10.4232/1.13530
****************English Version****************
The prepared Longitudinal IntermediaPlus dataset 2014 to 2016 is a "big data", which is why the entire dataset will only be available in the form of a database (MySQL). In this database, the information of different variables of a respondent is organized in one column, one below the other. The present publication includes a SQL-Database with the meta data of a sample of the full database, which represents a section of the available variables of the total data set and is intended to show the structure of the prepared data and the data-documentation (codebook) of the sample. For this purpose, the sample contains all variables of sociodemography, free-time activities, additional information on a respondent and his household as well as the interview-specific variables and weights. Only the variables concerning the respondent's media use are a small selection: For online media use, the variables of all overall offerings as well as the individual offerings of the genres politics and digital were included. The media use of radio, print and TV was not included in the sample because its structure can be traced using the published longitudinal data of the media analysis MA Radio, MA Pressemedien and MA Intermedia.
Due to the size of the datafile, the database with the actual survey data would already be in the critical range of the file size for the common upload and download. The actual survey result...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description: Clean and Ready for Relational Database Import This dataset is a comprehensive collection of well-structured and meticulously cleaned data, meticulously prepared for seamless integration into a relational database. The dataset has undergone thorough data cleansing procedures to ensure that it is free from inconsistencies, missing values, and duplicate records. This guarantees a smooth and efficient data analysis experience for users, without the need for additional preprocessing steps.