http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.
The key columns in the dataset are as follows:
In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.
By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.
This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.
Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.
With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The present study updates and extends the meta-analysis by Haus et al. (2013) who applied the theory of planned behavior (TPB) to analyze gender differences in the motivation to start a business. We extend this meta-analysis by investigating the moderating role of the societal context in which the motivation to start a business emerges and proceeds. The results, based on 119 studies analyzing 129 samples with 266,958 individuals from 36 countries, show smaller gender differences than the original study and reveal little differences across cultural regions in the effects of the tested model. A meta-regression analyzing the role of specific cultural dimensions and economic factors on gender-related correlations reveals significant effects only of gender egalitarianism and in the opposite direction as expected. In summary, the study contributes to the discussion on gender differences, the importance of study replications and updates of meta-analyses, and the generalizability of theories across cultural contexts. Dataset for: Steinmetz, H., Isidor, R., & Bauer, C. (2021). Gender Differences in the Intention to Start a Business. Zeitschrift Für Psychologie, 229(1), 70–84. https://doi.org/10.1027/2151-2604/a000435: Electronic supplementary material D - Data file
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the replication package for the paper titled "How Do Requirements Evolve During Elicitation? An Empirical Study Combining Interviews and App Store Analysis", by Alessio Ferrari, Paola Spoletini and Sourav Debnath.
The package contains the following folders and files.
/R-analysis
This is a folder containing all the R implementations of the the statistical tests included in the paper, together with the source .csv file used to produce the results. Each R file has the same title as the associated .csv file. The titles of the files reflect the RQs as they appear in the paper. The association between R files and Tables in the paper is as follows:
- RQ1-1-analyse-story-rates.R: Tabe 1, user story rates
- RQ1-1-analyse-role-rates.R: Table 1, role rates
- RQ1-2-analyse-story-category-phase-1.R: Table 3, user story category rates in phase 1 compared to original rates
- RQ1-2-analyse-role-category-phase-1.R: Table 5, role category rates in phase 1 compared to original rates
- RQ2.1-analysis-app-store-rates-phase-2.R: Table 8, user story and role rates in phase 2
- RQ2.2-analysis-percent-three-CAT-groups-ph1-ph2.R: Table 9, comparison of the categories of user stories in phase 1 and 2
- RQ2.2-analysis-percent-two-CAT-roles-ph1-ph2.R: Table 10, comparison of the categories of roles in phase 1 and 2.
The .csv files used for statistical tests are also used to produce boxplots. The association betwee boxplot figures and files is as follows.
- RQ1-1-story-rates.csv: Figure 4
- RQ1-1-role-rates.csv: Figure 5
- RQ1-2-categories-phase-1.csv: Figure 8
- RQ1-2-role-category-phase-1.csv: Figure 9
- RQ2-1-user-story-and-roles-phase-2.csv: Figure 13
- RQ2.2-percent-three-CAT-groups-ph1-ph2.csv: Figure 14
- RQ2.2-percent-two-CAT-roles-ph1-ph2.csv: Figure 17
- IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv: Figure 15
- IMG-only-RQ2.2-frequent-roles.csv: Figure 18
NOTE: The last two .csv files do not have an associated statistical tests, but are used solely to produce boxplots.
/Data-Analysis
This folder contains all the data used to answer the research questions.
RQ1.xlsx: includes all the data associated to RQ1 subquestions, two tabs for each subquestion (one for user stories and one for roles). The names of the tabs are self-explanatory of their content.
RQ2.1.xlsx: includes all the data for the RQ1.1 subquestion. Specifically, it includes the following tabs:
- Data Source-US-category: for each category of user story, and for each analyst, there are two lines.
The first one reports the number of user stories in that category for phase 1, and the second one reports the
number of user stories in that category for phase 2, considering the specific analyst.
- Data Source-role: for each category of role, and for each analyst, there are two lines.
The first one reports the number of user stories in that role for phase 1, and the second one reports the
number of user stories in that role for phase 2, considering the specific analyst.
- RQ2.1 rates: reports the final rates for RQ2.1.
NOTE: The other tabs are used to support the computation of the final rates.
RQ2.2.xlsx: includes all the data for the RQ2.2 subquestion. Specifically, it includes the following tabs:
- Data Source-US-category: same as RQ2.1.xlsx
- Data Source-role: same as RQ2.1.xlsx
- RQ2.2-category-group: comparison between groups of categories in the different phases, used to produce Figure 14
- RQ2.2-role-group: comparison between role groups in the different phases, used to produce Figure 17
- RQ2.2-specific-roles-diff: difference between specific roles, used to produce Figure 18
NOTE: the other tabs are used to support the computation of the values reported in the tabs above.
RQ2.2-single-US-category.xlsx: includes the data for the RQ2.2 subquestion associated to single categories of user stories.
A separate tab is used given the complexity of the computations.
- Data Source-US-category: same as RQ2.1.xlsx
- Totals: total number of user stories for each analyst in phase 1 and phase 2
- Results-Rate-Comparison: difference between rates of user stories in phase 1 and phase 2, used to produce the file
"img/IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv", which is in turn used to produce Figure 15
- Results-Analysts: number of analysts using each novel category produced in phase 2, used to produce Figure 16.
NOTE: the other tabs are used to support the computation of the values reported in the tabs above.
RQ2.3.xlsx: includes the data for the RQ2.3 subquestion. Specifically, it includes the following tabs:
- Data Source-US-category: same as RQ2.1.xlsx
- Data Source-role: same as RQ2.1.xlsx
- RQ2.3-categories: novel categories produced in phase 2, used to produce Figure 19
- RQ2-3-most-frequent-categories: most frequent novel categories
/Raw-Data-Phase-I
The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx, plus the file of the original user stories with annotations (original-us.xlsx). Each file contains two tabs:
- Evaluation: includes the annotation of the user stories as existing user story in the original categories (annotated with "E"), novel user story in a certain category (refinement, annotated with "N"), and novel user story in novel category (Name of the category in column "New Feature"). **NOTE 1:** It should be noticed that in the paper the case "refinement" is said to be annotated with "R" (instead of "N", as in the files) to make the paper clearer and easy to read.
- Roles: roles used in the user stories, and count of the user stories belonging to a certain role.
/Raw-Data-Phaes-II
The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx. Each file contains two tabs:
- Analysis: includes the annotation of the user stories as belonging to existing original
category (X), or to categories introduced after interviews, or to categories introduced
after app store inspired elicitation (name of category in "Cat. Created in PH1"), or to
entirely novel categories (name of category in "New Category").
- Roles: roles used in the user stories, and count of the user stories belonging to a certain role.
/Figures
This folder includes the figures reported in the paper. The boxplots are generated from the
data using the tool http://shiny.chemgrid.org/boxplotr/. The histograms and other plots are
produced with Excel, and are also reported in the excel files listed above.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
We are excited to announce that we have successfully extracted a comprehensive set of alcoholic beverage records from BevMo and compiled them into a CSV file.
This meticulously organized dataset includes key information such as product URLs, IDs, names, SKUs, GTIN14 barcodes, detailed product descriptions, availability status, pricing, currency, images, breadcrumbs, and more.
Our dataset provides an invaluable resource for anyone looking to analyze or utilize detailed BevMo product information.
Download the dataset today and gain access to a wealth of information from one of the leading beverage retailers.
Perfect for market analysis, e-commerce insights, and competitive research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Canada Trademarks Dataset
18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303
Dataset Selection and Arrangement (c) 2021 Jeremy Sheff
Python and Stata Scripts (c) 2021 Jeremy Sheff
Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.
This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.
Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.
Terms of Use:
As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.
The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:
The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.
Details of Repository Contents:
This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:
If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.
The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.
With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.
The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.
This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
Appendix.pdf Tags-topics.md Stack-exchange-query.md RQ1/ LDA_input/ combined-so-quora-mallet-metadata.csv topic-input.mallet LDA_output/ Mallet/ output_csv/ docs-in-topics.csv topic-words.csv topics-in-docs.csv topics-metadata.csv output_html/ all_topics.html Docs/ Topics/ RQ2/ datasource_rawdata/ quora.csv stackoverflow.csv manual_analysis_output/ stackoverflow_quora_taxonomy.xlsx
## Contents of the Replication Package --- - Appendix.pdf- Appendix of the paper containing supplement tables - Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2) - Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer. - RQ1/ - contains the data used to answer RQ1 - LDA_input/ - input data used for LDA analysis - combined-so-quora-mallet-metadata.csv
- Stack overflow and Quora questions used to perform LDA analysis - topic-input.mallet
- input file to the mallet tool - LDA_output/ - Mallet/ - contains the LDA output generated by MALLET tool - output_csv/ - docs-in-topics.csv
- documents per topic - topic-words.csv
- most relevant topic words - topics-in-docs.csv
- topic probability per document - topics-metadata.csv
- metadata per document and topic probability - output_html/ - Browsable results of mallet output - all_topics.html
- Docs/
- Topics/
- RQ2/ - contains the data used to answer RQ2 - datasource_rawdata/ - contains the raw data for each source - quora.csv
- contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool. - stackoverflow.csv
- contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool. - manual_analysis_output/ - stackoverflow_quora_taxonomy.xlsx
- contains the classified dataset of stackoverflow and quora and description of taxonomy. - Taxonomy
- contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by |
symbol. - stackoverflow-posts
- the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories. - quota-posts
- the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories. ---https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The csv file contains aggregated data on the results of the experiment (user_id), treatment type (group) and key user metrics(views and clicks) The task is to analyze the results of the experiment and write your recommendations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the GitHub repository for this work.
The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files.
The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce.
Dataset Contents
We briefly summarize the included files in our dataset. Please refer to the documentation for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline.
epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth: We share this model file, originally provided by Jobin et al., to enable the classification of figures found in our dataset. Please place this into the model/
directory.
model-results.csv: This file contains results from the classification performed on the figures found in the notebooks in our dataset.
Performing this classification may take upto a day.
a11y-scan-dataset.zip: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains:
a11y/a11y-detailed-result.csv: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes.
The detailed result file can be really large (> 60 GB) and can be time-consuming to construct.
a11y/a11y-aggregate-scan.csv: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook.
This file is also shared outside the compressed directory.
errors-different-counts-a11y-analyze-errors-summary.csv: This file contains the counts of errors that occur in notebooks across different themes.
nb_processed_cell_html.csv: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks.
nb_first_interactive_cell.csv: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook.
nb_processed.csv: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information.
processed_function_calls.csv: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.
Dataset
The artifact contains the resources described below.
Experiment resources
The resources needed for replicating the experiment, namely in directory experiment:
alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.
alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.
docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.
api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.
Experiment data
The task database used in our application of the experiment, namely in directory data/experiment:
Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.
identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.
Collected data
Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:
data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).
data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:
participant identification: participant's unique identifier (ID);
socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).
data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);
detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.
data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID);
user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).
participants.txt: the list of participant identifiers that have registered for the experiment.
Analysis scripts
The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:
analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.
requirements.r: An R script to install the required libraries for the analysis script.
normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.
normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.
Dockerfile: Docker script to automate the analysis script from the collected data.
Setup
To replicate the experiment and the analysis of the results, only Docker is required.
If you wish to manually replicate the experiment and collect your own data, you'll need to install:
A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.
If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:
Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.
R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.
Usage
Experiment replication
This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.
To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.
cd experimentdocker-compose up
This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.
In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:
Group N (no hints): http://localhost:3000/0CAN
Group L (error locations): http://localhost:3000/CA0L
Group E (counter-example): http://localhost:3000/350E
Group D (error description): http://localhost:3000/27AD
In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.
Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.
Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.
After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:
Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.
Analysis of other applications of the experiment
This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.
The analysis script expects data in 4 CSV files,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
df_force_kin_filtered.csv is the data sheet used for the DATA3 python notebook to analyse kinematics and dynamics combined. It contains the footfalls that hava data for both: kinematics and dynamics. To see how this file is generated, read the first half of the jupyter notebook
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Dog Food Data Extracted from Chewy (USA) dataset contains 4,500 detailed records of dog food products sourced from one of the leading pet supply platforms in the United States, Chewy. This dataset is ideal for businesses, researchers, and data analysts who want to explore and analyze the dog food market, including product offerings, pricing strategies, brand diversity, and customer preferences within the USA.
The dataset includes essential information such as product names, brands, prices, ingredient details, product descriptions, weight options, and availability. Organized in a CSV format for easy integration into analytics tools, this dataset provides valuable insights for those looking to study the pet food market, develop marketing strategies, or train machine learning models.
Key Features:
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Crypto Price Monitoring Repository
This repository contains two CSV data files that were created to support the research titled "Price Arbitrage for DeFi Derivatives." This research is to be presented at the IEEE International Conference on Blockchain and Cryptocurrencies, taking place on 5th May 2023 in Dubai, UAE. The data files include monitoring prices for various crypto assets from several sources. The data files are structured with five columns, providing information about the symbol, unified symbol, time, price, and source of the price.
## Data Files
There are two CSV data files in this repository (one for each date):
1. `Pricemon_results_2023_01_13.csv`
2. `Pricemon_results_2023_01_14.csv`
## Data Format
Both data files have the same format and structure, with the following five columns:
1. `symbol`: The trading symbol for the crypto asset (e.g., BTC, ETH).
2. `unified_symbol`: A standardized symbol used across different platforms.
3. `time`: Timestamp for when the price data was recorded (in UTC format).
4. `price`: The price of the crypto asset at the given time (in USD).
5. `source`: The name of the price source for the data.
## Price Sources
The `source` column in the data files refers to the provider of the price data for each record. The sources include:
- `chainlink`: Chainlink Price Oracle
- `mycellium`: Built-in oracle of the Mycellium platform
- `bitfinex`: Bitfinex cryptocurrency exchange
- `ftx`: FTX cryptocurrency exchange
- `binance`: Binance cryptocurrency exchange
## Usage
You can use these data files for various purposes, such as analyzing price discrepancies across different sources, identifying trends, or developing trading algorithms. To use the data, simply import the CSV files into your preferred data processing or analysis tool.
### Example
Here's an example of how you can read and display the data using Python and the pandas library:
import pandas as pd
# Read the data from CSV file
data = pd.read_csv('Pricemon_results_2023_01_13.csv')
# Display the first 5 rows of the data
print(data.head())`
## Acknowledgements
These datasets were recorded and supported by Datamint company (value-added on-chain data provider) and its team.
## Contributing
If you have any suggestions or find any issues with the data, please feel free to contact authors.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.
This large dataset is ideal for:
Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.
The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"
## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.
## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads
## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`
## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category
## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level
## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock the power of online marketplace analytics with our comprehensive eBay products dataset. This premium collection contains 1.29 million products from eBay's global marketplace, providing extensive insights into one of the world's largest e-commerce platforms. Perfect for competitive analysis, pricing strategies, market research, and machine learning applications in e-commerce.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This depository contains the images and associated RSML files used in the paper entitled "Using a structural root system model to evaluate and improve the accuracy of root image analysis pipelines" from the same authors.
It contains:
- 30.000 simulated root images of 10.000 different root systems, with different noise levels (0=null, 1=medium, 3=high);
- 10.000 corresponding RSML files;
- .csv files containing the ground-truth data for each modelled root system (500-data.csv);
- .csv files containing the image descriptors extracted using RIA-J (500-descriptors.csv);
The codes used to generate and analyse this dataset is available here: http://doi.org/10.5281/zenodo.208499
Note about the metrics computed from the RSML files (contained in 500-data.csv): the root systems were simulated in a constrained 2D space (rhizotron-like). Therefore, the RSML_reader plugin computed the lengths of the different roots in 2D only (despite the fact that a Z coordinate is present in the the RSML files).
CORRECTION: There is a scale issue in the 500-descriptors.csv file. "area" and "convexhull" columns values should be divided by 10
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock the power of data with our comprehensive Grainger Products Dataset, featuring over 220,000 meticulously curated records in CSV format. This dataset is an invaluable resource for businesses, researchers, and data scientists looking to optimize their operations, conduct market analysis, or enhance their machine learning models.
Each record in the dataset includes critical fields such as URL, title, brand, SKU, price, pricing unit, product model, product ID, product UNSPSC, breadcrumbs, images, specifications, compliance and restrictions, description, unique ID, and the scraped date. Whether you're analyzing product trends, comparing prices, or developing e-commerce solutions, this dataset provides the depth and breadth of information you need.
Submit you custom requests at grainger products page
Example Use Cases:
Start leveraging this data to make informed decisions and gain a competitive edge in your industry.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.
The key columns in the dataset are as follows:
In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.
By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.
This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.