https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
This archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contacts:
website: https://labsoft-ufmg.github.io/
email: henrique.mg.bh@gmail.com
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.
Our dataset is located at the path dataset/MaRV.json
The guidelines for replicating the study are provided below:
requirements.txt
.env
file based on .env.example
in the src
folder and set the variables:
CSV_PATH
: Path to the CSV file containing the list of repositories to be processed.CLONE_DIR
: Directory where repositories will be cloned.JAVA_PATH
: Path to the Java executable.REFACTORING_MINER_PATH
: Path to RefactoringMiner.pip install -r requirements.txt
CSV_PATH
should contain a column named name
with GitHub repository names (format: username/repo
)..env
file and set up the repositories CSV, then run:
python3 src/run_rm.py
CLONE_DIR
, retrieves the default branch, and runs RefactoringMiner to analyze it..json
files in CLONE_DIR
..log
files in the same directory.python3 src/count_refactorings.py
refactoring_count_by_type_and_file
, shows the number of refactorings for each technique, grouped by repository.To collect snippets before and after refactoring and their metadata, run:
python3 src/diff.py '[refactoring technique]'
Replace [refactoring technique]
with the desired technique name (e.g., Extract Method
).
The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.
Dataset Availability:
dataset
directory.To generate the SQL file for the Web tool, run:
python3 src/generate_refactorings_sql.py
web
directory.data/output/snippets
folder with the output of src/diff.py
.sql/create_database.sql
script in your database.src/generate_refactorings_sql.py
.dataset.php
to generate the MaRV dataset file.dataset
directory of the replication package.Signed into law in October 2020, the Chesapeake Watershed Investments for Landscape Defense (Chesapeake WILD) Act directed the Secretary of the Interior, through the U.S. Fish and Wildlife Service, to develop and implement the non-regulatory Chesapeake WILD Program. Chesapeake WILD was established for the following purposes: 1. Coordination among federal, state, local, and regional entities to establish a shared vision for sustaining natural resources and human communities throughout the Chesapeake Bay and its watershed, 2. Engagement of diverse agencies and organizations to build capacity and generate funding that address shared restoration and conservation priorities, and 3. Collaboration to administer a grant program and implement projects to conserve, steward, and enhance fish and wildlife habitats and related conservation values.This dataset represents the 25 projects awarded more than $7.4 million that generated $12 million in match from the grantees, providing a total conservation impact of $19.4 million for conservation investments support unmet place-based needs that align with five shared pillars. Spatial data are collected from grantees in a polygon geometry. The accuracy of polygons that are submitted depend on the methodology used by grantees to create them. Some grantees have a GIS team that uploads detailed project footprints developed by experienced GIS analysts. Others create project footprints during the grant application process using a map-based online drawing tool. All polygons are reviewed and approved by relevant program staff to ensure proper scale.Points are generated from polygons submitted by grantees. If there is one polygon, the point is placed in the centroid of the polygon. If there are multiple polygons that are within close proximity to each other, one point is generated as a representative point of the multiple polygons. If there are multiple polygons that are far away from each other, multiple points are created (the centroid of each of the polygons). Consequently, polygons are recommended for quantitative analyses requiring a high spatial resolution, and points are recommended for mapping the general location of many projects over large areas. - this means there are some duplicate points on the map.Please take the accuracy, methodology, and guidelines into consideration as you conduct your analyses. These data are intended for your internal analysis and the use case outlined above only. Spatial data should not be shared with other third party entities without approval from the National Fish and Wildlife Foundation. If you have any questions about the data, please feel free to contact to Mike Lagua (Michael.Lagua@nfwf.org).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By math_dataset (From Huggingface) [source]
This dataset comprises a collection of mathematical problems and their solutions designed for training and testing purposes. Each problem is presented in the form of a question, followed by its corresponding answer. The dataset covers various mathematical topics such as arithmetic, polynomials, and prime numbers. For instance, the arithmetic_nearest_integer_root_test.csv file focuses on problems involving finding the nearest integer root of a given number. Similarly, the polynomials_simplify_power_test.csv file deals with problems related to simplifying polynomials with powers. Additionally, the dataset includes the numbers_is_prime_train.csv file containing math problems that require determining whether a specific number is prime or not. The questions and answers are provided in text format to facilitate analysis and experimentation with mathematical problem-solving algorithms or models
Introduction: The Mathematical Problems Dataset contains a collection of various mathematical problems and their corresponding solutions or answers. This guide will provide you with all the necessary information on how to utilize this dataset effectively.
Understanding the columns: The dataset consists of several columns, each representing a different aspect of the mathematical problem and its solution. The key columns are:
- question: This column contains the text representation of the mathematical problem or equation.
- answer: This column contains the text representation of the solution or answer to the corresponding problem.
Exploring specific problem categories: To focus on specific types of mathematical problems, you can filter or search within the dataset using relevant keywords or terms related to your area of interest. For example, if you are interested in prime numbers, you can search for prime in the question column.
Applying machine learning techniques: This dataset can be used for training machine learning models related to natural language understanding and mathematics. You can explore various techniques such as text classification, sentiment analysis, or even sequence-to-sequence models for solving mathematical problems based on their textual representations.
Generating new questions and solutions: By analyzing patterns in this dataset, you can generate new questions and solutions programmatically using techniques like data augmentation or rule-based methods.
Validation and evaluation: As with any other machine learning task, it is essential to validate your models on separate validation sets not included in this dataset properly. You can also evaluate model performance by comparing predictions against known answers provided in this dataset's answer column.
Sharing insights and findings: After working with this datasets, it would be beneficial for researchers or educators to share their insights, approaches taken during analysis/modelling as Kaggle notebooks/ discussions/ blogs/ tutorials etc., so that others could get benefited from such shared resources too.
Note: Please note that the dataset does not include dates.
By following these guidelines, you can effectively explore and utilize the Mathematical Problems Dataset for various mathematical problem-solving tasks. Happy exploring!
- Developing machine learning algorithms for solving mathematical problems: This dataset can be used to train and test models that can accurately predict the solution or answer to different mathematical problems.
- Creating educational resources: The dataset can be used to create a wide variety of educational materials such as problem sets, worksheets, and quizzes for students studying mathematics.
- Research in mathematical problem-solving strategies: Researchers and educators can analyze the dataset to identify common patterns or strategies employed in solving different types of mathematical problems. This analysis can help improve teaching methodologies and develop effective problem-solving techniques
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purpos...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(KinMap_Examples.zip) contains the input CSV files used to generate the annotated kinome trees in Fig. 1 (Example_1_Erlotinib_NSCLC.csv), Fig. 2a (Example_2_Sunitinib_Sorafenib_Cancer.csv), and Fig. 2b (Example_3_Kinase_Stats.csv). (ZIP 5 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data is sourced from CSIRO Parkes ATNF.eg http://www.atnf.csiro.au/research/pulsar/psrcat/Feel the pulse of the universeWe're taking signal data from astronomical "pulsar" sources and creating a way to listen to their signals audibly.Pulsar data is available from ATNF at CSIRO.au. Our team at #SciHackMelb has been working on a #datavis to give researchers and others a novel way to explore the Pulsar corpus, especially through the sound of the frequencies at which the Pulsars emit pulses.Link to project page at #SciHackMelb - http://www.the-hackfest.com/events/melbourne-science-hackfest/projects/pulsar-voices/The files attached here include: source data, project presentation, data as used in website final_pulsar.sql, and other methodology documentation. Importantly, see the Github link which contains data manipulation code, html code to present the data, and render audibly, iPython Notebook to process single pulsar data into an audible waveform file. Together all these resources are the Pulsar Voices activity and resulting data.Source Data;* RA - east/west coordinates (0 - 24 hrs, roughly equates to longitude) [theta; transforms RA to 0 - 360*]* Dec - north/south coordinates (-90, +90 roughly equates to latitude i.e. 90 is above north pole, and -90 south pole)* P0 - the time in seconds that a pulsar repeats its signal* f - 1/P0 which ranges from 700 cycles per sec, to some which pulses which occur every few seconds* kps - distance from Earth in kilo-parsecs. 1 kps = 3,000 light years. The furthest data is 30 kps. The galactic centre is about 25,000 light years away i.e. about 8kps.psrcatShort.csv = 2,295 Pulsars all known pulsars with above fields; RA, Dec, ThetapsrcatMedium.csv - add P0 and kps, only 1428 lines - i.e. not available for all 2,295 datapointpsrcatSparse.csv - add P0 and kps, banks if n/a, 2,295 linesshort.txt - important pulsars with high levels of observation (** even more closely examined)pulsar.R - code contributed by Ben Raymond to visualise Pulsar frequency, period in histogrampulsarVoices_authors.JPG - added photo of authors from SciHackMelbAdded to the raw data:- Coordinates to map RA, Dec to screen width(y)/height(x)y = RA[Theta]*width/360; x = (Dec + 90)*height/180- audible frequency converted from Pulsar frequency (1/P0)Formula for 1/P0(x) -> Hz(y) => y = 10 ^ (0.5 log(x) + 2.8)Explanation in text file; Convert1/P0toHz.txtTone generator from: http://www.softsynth.com/webaudio/tone.php- detailed waveform file audible converted from Pulsar signal data, and waveform image (and python notebook to generate; available):The project source is hosted on github at:https://github.com/gazzar/pulsarvoicesAn IPython/Jupyter notebook contains code and a rough description of the method used to process a psrfits .sf filedownloaded via the CSIRO Data Access Portal at http://doi.org/10.4225/08/55940087706E1The notebook contains experimental code to read one of these .sf files and access the contained spectrogram data, processing it to generate an audible signal.It also reads the .txt files containing columnar pulse phase data (which is also contained in the .sf files) and processes these by frequency modulating the signal with an audible carrier.This is the method used to generate the .wav and .png files used in the web interface.https://github.com/gazzar/pulsarvoices/blob/master/ipynb/hackfest1.ipynb A standalone python script that does the .txt to .png and .wav signal processing was used to process 15 more pulsar data examples. These can be reproduced by running the script.https://github.com/gazzar/pulsarvoices/blob/master/data/pulsarvoices.pyProcessed file at: https://github.com/gazzar/pulsarvoices/tree/master/webhttps://github.com/gazzar/pulsarvoices/blob/master/web/J0437-4715.pngJ0437-4715.wav | J0437-4715.png)#Datavis online at: http://checkonline.com.au/tooltip.php. Code at Github linked above. See especially:https://github.com/gazzar/pulsarvoices/blob/master/web/index.phpparticularly, lines 314 - 328 (or search: "SELECT * FROM final_pulsar";) which loads pulsar data from DB and push to screen with Hz on mouseover.Pulsar Voices webpage Functions:1.There is sound when you run the mouse across the Pulsars. We plot all known pulsars (N=2,295), and play a tone for pulsars we had data on frequency i.e. about 75%.2. In the bottom left corner a more detailed Pulsar sound, and wave image pops up when you click the star icon. Two of the team worked exclusively on turning a single pulsars waveform into an audible wav file. They created 16 of these files, and a workflow, but the team only had time to load one waveform. With more time, it would be great to load these files.3. If you leave the mouse over a Pulsar, a little data description pops up, with location (RA, Dec), distance (kilo parsecs; 1 = 3,000 light years), and frequency of rotation (and Hz converted to human hearing).4.If you click on a Pulsar, other pulsars with similar frequency are highlighted in white. With more time I was interested to see if there are harmonics between pulsars. i.e. related frequencies.The TeamMichael Walker is: orcid.org/0000-0003-3086-6094 ; Biosciences PhD student, Unimelb, Melbourne.Richard Ferrers is: orcid.org/0000-0002-2923-9889 ; ANDS Research Data Analyst, Innovation/Value Researcher, Melbourne.Sarath Tomy is: http://orcid.org/0000-0003-4301-0690 ; La Trobe PhD Comp Sci, Melbourne.Gary Ruben is: http://orcid.org/0000-0002-6591-1820 ; CSIRO Postdoc at Australian Synchrotron, Melbourne.Christopher Russell is: Data Manager, CSIRO, Sydney.https://wiki.csiro.au/display/ASC/Chris+RussellAnderson Murray is: orcid.org/0000-0001-6986-9140; Physics Honours, Monash, Melbourne.Contact: richard.ferrers@ands.org.au for more information.What is still left to do?* load data, description, images fileset to figshare :: DOI ; DONE except DOI* add overview images as option eg frequency bi-modal histogram* colour code pulsars by distance; DONE* add pulsar detail sound to Top three Observants; 16 pulsars processed but not loaded* add tones to pulsars to indicate f; DONE* add tooltips to show location, distance, frequency, name; DONE* add title and description; DONE* project data onto a planetarium dome with interaction to play pulsar frequencies.DONE see youtube video at https://youtu.be/F119gqOKJ1U* zoom into parts of sky to get separation between close data points - see youtube; function in Google Earth #datavis of dataset. Link at youtube.* set upper and lower tone boundaries, so tones aren't annoying* colour code pulsars by frequency bins e.g. >100 Hz, 10 - 100, 1 - 10,
Notifications are important for the user experience in mobile apps and can influence their engagement. However, too many notifications can be disruptive for users. In this work, we study a novel centralized approach for notification optimization, where we view the opportunities to send user notifications as items and types of notifications as buyers in an auction market.
The full dataset, instagram_notification_auction_base_dataset.csv, contains all generated notifications for a subset of Instagram users across four notification types within a certain time window. Each entry of the dataset represents one generated notification. For each generated notification, we include some information related to the notification as well as information related to the auctions performed to determine if the generated notification can be sent to users. See the README file for detailed column decriptions. The dataset was collected during an A/B test where we compare the performance of the first-price auction system with that of the second-price auction system.
The two derived datasets can be useful to study fair online allocation and Fisher market equilibrium. See the README for details and a link to the scripts that generate the derived datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CSV files contain the raw data, which are used to generate Figures in the paper. LN_chiral_T.py is the python code for RCWA solver. All the transmission spectra in the paper are calculated based on LN_chiral_T.py. LN_chiral_modeanalysis.mph is a model of the COMSOL Multiphysics. The resonant frequencies and damping rates, electric field distributions, and optical chiral densities are simulated by using LN_chiral_modeanalysis.mph.In Figure 1, Fig1b.csv file is the raw data of Figure 1(b), calculated with LN_chiral_T.py. Fig1c_sim.csv and Fig1d_sim.csv are the raw data of the 'triangle maker' curves in Figure 1(c) and 1(d), which are generated from the LN_chiral_modeanalysis.mph. Fig1c_anal.csv and Fig1d_anal.csv are the raw data of the solid curves in Figure 1(c) and 1(d), which calculated using the Eq. (3) of the main text. Figure 1(e) is generated from the COMSOL model LN_chiral_modeanalysis.mph. In Figure 2, Fig2a.csv is the raw data of Figure 2(a), which are calculated using the Eq. (3) of the main text. Fig2b.csv is the raw data of Figure 2(b), which is generated from LN_chiral_modeanalysis.mph. Figure 2(c) is generated from the COMSOL model LN_chiral_modeanalysis.mph.In Figure 3, Fig3a.csv and Fig3b.csv are raw data of Figure 3(a) and Figure 3(b), which are calculated with RCWA code LN_chiral_T.py. Fig3c.csv and Fig3d.csv are raw data of Figure 3(c) and Figure 3(d), which are generated based on Figure 3(a) and Figure 3(b). Figure 3(e) is produced by LN_chiral_modeanalysis.mph. Fig3f.csv is the raw data of Figure 3(f), which is generated from LN_chiral_modeanalysis.mph.In Figure 4, Fig4b.csv and Fig4d.csv are raw experimental data of Figure 4(b) and Figure 4(d). Fig4c_blue.csv and Fig4c_red.csv are raw data of Figure 4(c), which is calculated based on Figure 4(b) and Figure 4(d).To run LN_chiral_modeanalysis.mph, COMSOL Multiphysics software with version 5.6 or higher should be installed. To run LN_chiral_T.py, rcwa solver S4 (https://web.stanford.edu/group/fan/S4/) should be installed.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Airline data holds immense importance as it offers insights into the functioning and efficiency of the aviation industry. It provides valuable information about flight routes, schedules, passenger demographics, and preferences, which airlines can leverage to optimize their operations and enhance customer experiences. By analyzing data on delays, cancellations, and on-time performance, airlines can identify trends and implement strategies to improve punctuality and mitigate disruptions. Moreover, regulatory bodies and policymakers rely on this data to ensure safety standards, enforce regulations, and make informed decisions regarding aviation policies. Researchers and analysts use airline data to study market trends, assess environmental impacts, and develop strategies for sustainable growth within the industry. In essence, airline data serves as a foundation for informed decision-making, operational efficiency, and the overall advancement of the aviation sector.
This dataset comprises diverse parameters relating to airline operations on a global scale. The dataset prominently incorporates fields such as Passenger ID, First Name, Last Name, Gender, Age, Nationality, Airport Name, Airport Country Code, Country Name, Airport Continent, Continents, Departure Date, Arrival Airport, Pilot Name, and Flight Status. These columns collectively provide comprehensive insights into passenger demographics, travel details, flight routes, crew information, and flight statuses. Researchers and industry experts can leverage this dataset to analyze trends in passenger behavior, optimize travel experiences, evaluate pilot performance, and enhance overall flight operations.
https://i.imgur.com/cUFuMeU.png" alt="">
The dataset provided here is a simulated example and was generated using the online platform found at Mockaroo. This web-based tool offers a service that enables the creation of customizable Synthetic datasets that closely resemble real data. It is primarily intended for use by developers, testers, and data experts who require sample data for a range of uses, including testing databases, filling applications with demonstration data, and crafting lifelike illustrations for presentations and tutorials. To explore further details, you can visit their website.
Cover Photo by: Kevin Woblick on Unsplash
Thumbnail by: Airplane icons created by Freepik - Flaticon
What this is A privacy-safe synthetic retail transactions dataset with realistic baskets and prices, zero PII, and production-style schema so you can unblock AOV/CLV/demand/promo POCs immediately.
What you get - Line-item transactions with quantity, unit_price, discount, location, timestamp - Derived metric: total_value per line/basket - Formats: CSV and Parquet - Docs: data dictionary + daily manifest; optional locked schema - Delivery: secure S3 link or private share (Snowflake/Databricks/BigQuery) on request
Why synthetic? - Zero PII and GDPR-friendly experimentation - Start modeling today while prod access is pending - Stable distributions for benchmarking algorithms and pipelines
Known limits - Not a drop-in replacement for your production POS/e-com telemetry - Aggregate patterns approximate real-world behaviour; individual rows are synthetic
Sample (2,000 rows)
We include a 2,000-row CSV that mirrors the production schema (nulls 0% across sample columns; unique transaction_id
; ISO8601 timestamps). Add your own summary stats (AOV, discount rate, category mix) if desired.
Support & customizations Need region/category mixes, holiday uplift, or promo elasticity scenarios? We can generate tailored cohorts and refresh cadences.
To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)
We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:
Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.
Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.
Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].
Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.
File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.
VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.
TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.
The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.
To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).
The full research paper outlining the details of the dataset and its underlying principles:
Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.
For more information contact cic@unb.ca.
The UNB ISCX Network Traffic Dataset content
Traffic: Content
Web Browsing: Firefox and Chrome
Email: SMPTS, POP3S and IMAPS
Chat: ICQ, AIM, Skype, Facebook and Hangouts
Streaming: Vimeo and Youtube
File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
P2P: uTorrent and Transmission (Bittorrent)
; cic@unb.ca.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.
The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.
The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder) and a smaller version of the dataset (all_stocks_1yr.csv) with only the past year's stock data for those wishing to use something more manageable in size.
The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv and all_stocks_1yr.csv contain this same data, presented in merged .csv files. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.
All the files have the following columns: Date - in format: yy-mm-dd Open - price of the stock at market open (this is NYSE data so all in USD) High - Highest price reached in the day Low Close - Lowest price reached in the day Volume - Number of shares traded Name - the stock's ticker name
I scraped this data from Google finance using the python library 'pandas_datareader'. Special thanks to Kaggle, Github and The Market.
This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current DNA sequencing technologies and their high-throughput yield, allowed the thrive of genomic and transcriptomic experiments but it also have generated big data problem. Due to this exponential growth of sequencing data, also the complexity of managing, processing and interpreting it in order to generate results, has raised. Therefore, the demand of easy-to-use friendly software and websites to run bioinformatic tools is imminent. In particular, RNA-Seq and differential expression analysis have become a popular and useful method to evaluate the genetic expression change in any organism. However, many scientists struggle with the data analysis since most of the available tools are implemented in a UNIX-based environment. Therefore, we have developed the web server IDEAMEX (Integrative Differential Expression Analysis for Multiple EXperiments). The IDEAMEX pipeline needs a raw count table for as many desired replicates and conditions, allowing the user to select which conditions will be compared, instead of doing all-vs.-all comparisons. The whole process consists of three main steps (1) Data Analysis: that allows a preliminary analysis for quality control based on the data distribution per sample, using different types of graphs; (2) Differential expression: performs the differential expression analysis with or without batch effect error awareness, using the bioconductor packages, NOISeq, limma-Voom, DESeq2 and edgeR, and generate reports for each method; (3) Result integration: the obtained results the integrated results are reported using different graphical outputs such as correlograms, heatmaps, Venn diagrams and text lists. Our server allows an easy and friendly visualization for results, providing an easy interaction during the analysis process, as well as error tracking and debugging by providing output log files. The server is currently available and can be accessed at http://www.uusmb.unam.mx/ideamex/ where the documentation and example input files are provided. We consider that this web server can help other researchers with no previous bioinformatic knowledge, to perform their analyses in a simple manner.
The data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.
The data is provided in 4 .csv files with the names:
The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.
Background:
CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.
The data contains answers to two types of questions: TTO and DCE.
In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).
To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.
Ids have been randomized with preserved linkage between the DCE and TTO dataset.
Data files and variables:
Below is a description of the variables in each CSV file. - tto.csv:
config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.
config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.
config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above: health, socrel, finhou, occu, secu, polciv). value: Continuous, weight on the 0 to 1 scale. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background and PurposeAbout 20.1% of intracranial aneurysms (IAs) carriers are multiple intracranial aneurysms (MIAs) patients with higher rupture risk and worse prognosis. A prediction model may bring some potential benefits. This study attempted to develop and externally validate a dynamic nomogram to assess the rupture risk of each IA among patients with MIA.MethodWe retrospectively analyzed the data of 262 patients with 611 IAs admitted to the Hunan Provincial People's Hospital between November 2015 and November 2021. Multivariable logistic regression (MLR) was applied to select the risk factors and derive a nomogram model for the assessment of IA rupture risk in MIA patients. To externally validate the nomogram, data of 35 patients with 78 IAs were collected from another independent center between December 2009 and May 2021. The performance of the nomogram was assessed in terms of discrimination, calibration, and clinical utility.ResultSize, location, irregular shape, diabetes history, and neck width were independently associated with IA rupture. The nomogram showed a good discriminative ability for ruptured and unruptured IAs in the derivation cohort (AUC = 0.81; 95% CI, 0.774–0.847) and was successfully generalized in the external validation cohort (AUC = 0.744; 95% CI, 0.627–0.862). The nomogram was calibrated well, and the decision curve analysis showed that it would generate more net benefit in identifying IA rupture than the “treat all” or “treat none” strategies at the threshold probabilities ranging from 10 to 60% both in the derivation and external validation set. The web-based dynamic nomogram calculator was accessible on https://wfs666.shinyapps.io/onlinecalculator/.ConclusionExternal validation has shown that the model was the potential to assist clinical identification of dangerous aneurysms after longitudinal data evaluation. Size, neck width, and location are the primary risk factors for ruptured IAs.
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Use the Public Notification template to allow users to create a list of features that they can export to a CSV or PDF file. They can create lists of features by searching for a single location, drawing an area of interest to include intersecting features, or using the geometry of an existing feature as the area of interest. Users can include a search buffer around a defined area of interest to expand the list of features. Examples: Export a CSV file with addresses for residents to alert about road closures. Create an inventory of contact information for parents in a school district. Generate a PDF file to print address labels and the corresponding map for community members. Data requirements The Public Notification template requires a feature layer to use all of its capabilities. Key app capabilities Search radius - Define a distance for a search buffer that selects intersecting input features to include in the list. Export - Save the results from the lists created in the app. Users can export the data to CSV and PDF format. Refine selection - Allow users to revise the selected features in the lists they create by adding or removing features with sketch tools. Sketch tools - Draw graphics on the map to select features to add to a list. Users can also use features from a layer to select intersecting features from the input layer. Home, Zoom controls, Legend, Layer List, Search Supportability This web app is designed responsively to be used in browsers on desktops, mobile phones, and tablets. We are committed to ongoing efforts towards making our apps as accessible as possible. Please feel free to leave a comment on how we can improve the accessibility of our apps for those who use assistive technologies.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.