100+ datasets found

Real State Website Data
kaggle.com
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M. Mazhar
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.
c
Mecca Australia Extracted Data in CSV Format
crawlfeeds.com
csv, zip
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Mecca Australia Extracted Data in CSV Format [Dataset]. https://crawlfeeds.com/datasets/mecca-australia-extracted-data-in-csv-format
Explore at:
csv, zipAvailable download formats
Dataset updated
Sep 2, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.

Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.

Features:

Product Information: Detailed data on various beauty products, including product names, categories, and brands.

Pricing Data: Up-to-date pricing details for each product, allowing for competitive analysis and pricing strategy development.

Product Descriptions: Comprehensive descriptions that provide insight into product features and benefits.

Stock Availability: Information on stock status to help track product availability and manage inventory.

CSV Format: Easy-to-use CSV file format for seamless integration into any data analysis or business intelligence tool.

Applications:

Market Analysis: Gain insights into the beauty market trends in Australia by analyzing product categories, brands, and pricing.

Competitor Research: Compare product offerings and pricing strategies to understand the competitive landscape.

Inventory Management: Use stock availability data to optimize inventory and ensure popular items are always in stock.

Product Development: Leverage product descriptions to identify gaps in the market and innovate new product offerings.

With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
p
Cultural Clusters Data.csv
psycharchives.org
Updated Dec 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Cultural Clusters Data.csv [Dataset]. https://www.psycharchives.org/en/item/dacf9eae-a8fb-466c-b222-eee51fb24a27
Explore at:
Dataset updated
Dec 17, 2020
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The present study updates and extends the meta-analysis by Haus et al. (2013) who applied the theory of planned behavior (TPB) to analyze gender differences in the motivation to start a business. We extend this meta-analysis by investigating the moderating role of the societal context in which the motivation to start a business emerges and proceeds. The results, based on 119 studies analyzing 129 samples with 266,958 individuals from 36 countries, show smaller gender differences than the original study and reveal little differences across cultural regions in the effects of the tested model. A meta-regression analyzing the role of specific cultural dimensions and economic factors on gender-related correlations reveals significant effects only of gender egalitarianism and in the opposite direction as expected. In summary, the study contributes to the discussion on gender differences, the importance of study replications and updates of meta-analyses, and the generalizability of theories across cultural contexts. Dataset for: Steinmetz, H., Isidor, R., & Bauer, C. (2021). Gender Differences in the Intention to Start a Business. Zeitschrift Für Psychologie, 229(1), 70–84. https://doi.org/10.1027/2151-2604/a000435: Electronic supplementary material D - Data file
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
figshare.com
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.6084/m9.figshare.20416092.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.20416092.v1
Dataset updated
Aug 1, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Elizabeth Szkirpan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Replication Package - How Do Requirements Evolve During Elicitation? An...
zenodo.org
bin, zip
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Ferrari; Alessio Ferrari; Paola Spoletini; Paola Spoletini; Sourav Debnath; Sourav Debnath (2022). Replication Package - How Do Requirements Evolve During Elicitation? An Empirical Study Combining Interviews and App Store Analysis [Dataset]. http://doi.org/10.5281/zenodo.6472498
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6472498
Dataset updated
Apr 21, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alessio Ferrari; Alessio Ferrari; Paola Spoletini; Paola Spoletini; Sourav Debnath; Sourav Debnath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the replication package for the paper titled "How Do Requirements Evolve During Elicitation? An Empirical Study Combining Interviews and App Store Analysis", by Alessio Ferrari, Paola Spoletini and Sourav Debnath.

The package contains the following folders and files.

/R-analysis

This is a folder containing all the R implementations of the the statistical tests included in the paper, together with the source .csv file used to produce the results. Each R file has the same title as the associated .csv file. The titles of the files reflect the RQs as they appear in the paper. The association between R files and Tables in the paper is as follows:

- RQ1-1-analyse-story-rates.R: Tabe 1, user story rates

- RQ1-1-analyse-role-rates.R: Table 1, role rates

- RQ1-2-analyse-story-category-phase-1.R: Table 3, user story category rates in phase 1 compared to original rates

- RQ1-2-analyse-role-category-phase-1.R: Table 5, role category rates in phase 1 compared to original rates

- RQ2.1-analysis-app-store-rates-phase-2.R: Table 8, user story and role rates in phase 2

- RQ2.2-analysis-percent-three-CAT-groups-ph1-ph2.R: Table 9, comparison of the categories of user stories in phase 1 and 2

- RQ2.2-analysis-percent-two-CAT-roles-ph1-ph2.R: Table 10, comparison of the categories of roles in phase 1 and 2.

The .csv files used for statistical tests are also used to produce boxplots. The association betwee boxplot figures and files is as follows.

- RQ1-1-story-rates.csv: Figure 4

- RQ1-1-role-rates.csv: Figure 5

- RQ1-2-categories-phase-1.csv: Figure 8

- RQ1-2-role-category-phase-1.csv: Figure 9

- RQ2-1-user-story-and-roles-phase-2.csv: Figure 13

- RQ2.2-percent-three-CAT-groups-ph1-ph2.csv: Figure 14

- RQ2.2-percent-two-CAT-roles-ph1-ph2.csv: Figure 17

- IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv: Figure 15

- IMG-only-RQ2.2-frequent-roles.csv: Figure 18

NOTE: The last two .csv files do not have an associated statistical tests, but are used solely to produce boxplots.

/Data-Analysis

This folder contains all the data used to answer the research questions.

RQ1.xlsx: includes all the data associated to RQ1 subquestions, two tabs for each subquestion (one for user stories and one for roles). The names of the tabs are self-explanatory of their content.

RQ2.1.xlsx: includes all the data for the RQ1.1 subquestion. Specifically, it includes the following tabs:

- Data Source-US-category: for each category of user story, and for each analyst, there are two lines.

The first one reports the number of user stories in that category for phase 1, and the second one reports the

number of user stories in that category for phase 2, considering the specific analyst.

- Data Source-role: for each category of role, and for each analyst, there are two lines.

The first one reports the number of user stories in that role for phase 1, and the second one reports the

number of user stories in that role for phase 2, considering the specific analyst.

- RQ2.1 rates: reports the final rates for RQ2.1.

NOTE: The other tabs are used to support the computation of the final rates.

RQ2.2.xlsx: includes all the data for the RQ2.2 subquestion. Specifically, it includes the following tabs:

- Data Source-US-category: same as RQ2.1.xlsx

- Data Source-role: same as RQ2.1.xlsx

- RQ2.2-category-group: comparison between groups of categories in the different phases, used to produce Figure 14

- RQ2.2-role-group: comparison between role groups in the different phases, used to produce Figure 17

- RQ2.2-specific-roles-diff: difference between specific roles, used to produce Figure 18

NOTE: the other tabs are used to support the computation of the values reported in the tabs above.

RQ2.2-single-US-category.xlsx: includes the data for the RQ2.2 subquestion associated to single categories of user stories.

A separate tab is used given the complexity of the computations.

- Data Source-US-category: same as RQ2.1.xlsx

- Totals: total number of user stories for each analyst in phase 1 and phase 2

- Results-Rate-Comparison: difference between rates of user stories in phase 1 and phase 2, used to produce the file

"img/IMG-only-RQ2.2-us-category-comparison-ph1-ph2.csv", which is in turn used to produce Figure 15

- Results-Analysts: number of analysts using each novel category produced in phase 2, used to produce Figure 16.

NOTE: the other tabs are used to support the computation of the values reported in the tabs above.

RQ2.3.xlsx: includes the data for the RQ2.3 subquestion. Specifically, it includes the following tabs:

- Data Source-US-category: same as RQ2.1.xlsx

- Data Source-role: same as RQ2.1.xlsx

- RQ2.3-categories: novel categories produced in phase 2, used to produce Figure 19

- RQ2-3-most-frequent-categories: most frequent novel categories

/Raw-Data-Phase-I

The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx, plus the file of the original user stories with annotations (original-us.xlsx). Each file contains two tabs:

- Evaluation: includes the annotation of the user stories as existing user story in the original categories (annotated with "E"), novel user story in a certain category (refinement, annotated with "N"), and novel user story in novel category (Name of the category in column "New Feature"). **NOTE 1:** It should be noticed that in the paper the case "refinement" is said to be annotated with "R" (instead of "N", as in the files) to make the paper clearer and easy to read.

- Roles: roles used in the user stories, and count of the user stories belonging to a certain role.

/Raw-Data-Phaes-II

The folder contains one Excel file for each analyst, s1.xlsx...s30.xlsx. Each file contains two tabs:

- Analysis: includes the annotation of the user stories as belonging to existing original

category (X), or to categories introduced after interviews, or to categories introduced

after app store inspired elicitation (name of category in "Cat. Created in PH1"), or to

entirely novel categories (name of category in "New Category").

- Roles: roles used in the user stories, and count of the user stories belonging to a certain role.

/Figures

This folder includes the figures reported in the paper. The boxplots are generated from the

data using the tool http://shiny.chemgrid.org/boxplotr/. The histograms and other plots are

produced with Excel, and are also reported in the excel files listed above.
c
BevMo Alcoholic Beverage Records Extracted - Download Comprehensive CSV...
crawlfeeds.com
csv, zip
Updated Sep 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). BevMo Alcoholic Beverage Records Extracted - Download Comprehensive CSV Dataset Now [Dataset]. https://crawlfeeds.com/datasets/bevmo-alcoholic-beverage-records-extracted-download-comprehensive-csv-dataset-now
Explore at:
csv, zipAvailable download formats
Dataset updated
Sep 7, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
We are excited to announce that we have successfully extracted a comprehensive set of alcoholic beverage records from BevMo and compiled them into a CSV file.

This meticulously organized dataset includes key information such as product URLs, IDs, names, SKUs, GTIN14 barcodes, detailed product descriptions, availability status, pricing, currency, images, breadcrumbs, and more.

Our dataset provides an invaluable resource for anyone looking to analyze or utilize detailed BevMo product information.

Download the dataset today and gain access to a wealth of information from one of the leading beverage retailers.

Perfect for market analysis, e-commerce insights, and competitive research.
The Canada Trademarks Dataset
zenodo.org
pdf, zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4999655
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeremy Sheff; Jeremy Sheff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Canada
Description
The Canada Trademarks Dataset

18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

Python and Stata Scripts (c) 2021 Jeremy Sheff

Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

Terms of Use:

As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

Details of Repository Contents:

This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

/csv: contains the .csv versions of the data files

/do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset

/dta: contains the .dta versions of the data files

/py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.
o
Replication package for the paper "What do Developers Discuss about Code...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jun 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Replication package for the paper "What do Developers Discuss about Code Comments" [Dataset]. http://doi.org/10.5281/zenodo.5044270
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5044270
Dataset updated
Jun 30, 2021
Description
RP-commenting-practices-multiple-sources Replication package for the paper "What do Developers Discuss about Code Comments?" ## Structure Appendix.pdf Tags-topics.md Stack-exchange-query.md RQ1/ LDA_input/ combined-so-quora-mallet-metadata.csv topic-input.mallet LDA_output/ Mallet/ output_csv/ docs-in-topics.csv topic-words.csv topics-in-docs.csv topics-metadata.csv output_html/ all_topics.html Docs/ Topics/ RQ2/ datasource_rawdata/ quora.csv stackoverflow.csv manual_analysis_output/ stackoverflow_quora_taxonomy.xlsx ## Contents of the Replication Package --- - Appendix.pdf- Appendix of the paper containing supplement tables - Tags-topics.md tags selected from Stack overflow and topics selected from Quora for the study (RQ1 & RQ2) - Stack-exchange-query.md the query interface used to extract the posts from stack exchnage explorer. - RQ1/ - contains the data used to answer RQ1 - LDA_input/ - input data used for LDA analysis - combined-so-quora-mallet-metadata.csv - Stack overflow and Quora questions used to perform LDA analysis - topic-input.mallet - input file to the mallet tool - LDA_output/ - Mallet/ - contains the LDA output generated by MALLET tool - output_csv/ - docs-in-topics.csv - documents per topic - topic-words.csv - most relevant topic words - topics-in-docs.csv - topic probability per document - topics-metadata.csv - metadata per document and topic probability - output_html/ - Browsable results of mallet output - all_topics.html - Docs/ - Topics/ - RQ2/ - contains the data used to answer RQ2 - datasource_rawdata/ - contains the raw data for each source - quora.csv - contains the processed dataset (like removing html tags). To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool. - stackoverflow.csv - contains the processed stackoverflow dataset. To know more about the preprocessing steps, please refer to the reproducibility section in the paper. The data is preprocessed using Makar tool. - manual_analysis_output/ - stackoverflow_quora_taxonomy.xlsx - contains the classified dataset of stackoverflow and quora and description of taxonomy. - Taxonomy - contains the description of the first dimension and second dimension categories. Second dimension categories are further divided into levels, separated by | symbol. - stackoverflow-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories. - quota-posts - the questions are labelled relevant or irrelevant and categorized into the first dimension and second dimension categories. ---
A/B Trial Aggregated Data
kaggle.com
Updated Sep 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Logvinov (2022). A/B Trial Aggregated Data [Dataset]. https://www.kaggle.com/datasets/sergylog/ab-trial-aggregated-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sergei Logvinov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The csv file contains aggregated data on the results of the experiment (user_id), treatment type (group) and key user metrics(views and clicks) The task is to analyze the results of the experiment and write your recommendations.
Z
Data from: Notably Inaccessible – Data Driven Understanding of Data Science...
data.niaid.nih.gov
zenodo.org
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Singanamalla, Sudheesh (2023). Notably Inaccessible – Data Driven Understanding of Data Science Notebook (In)Accessibility [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8185049
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
Potluri, Venkatesh
Tieanklin, Nussara
Mankoff, Jennifer
Singanamalla, Sudheesh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper. We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the GitHub repository for this work.

The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files.

The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce.

Dataset Contents

We briefly summarize the included files in our dataset. Please refer to the documentation for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline.

epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth: We share this model file, originally provided by Jobin et al., to enable the classification of figures found in our dataset. Please place this into the model/ directory.

model-results.csv: This file contains results from the classification performed on the figures found in the notebooks in our dataset.

Performing this classification may take upto a day.

a11y-scan-dataset.zip: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains:

a11y/a11y-detailed-result.csv: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes.

The detailed result file can be really large (> 60 GB) and can be time-consuming to construct.

a11y/a11y-aggregate-scan.csv: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook.

This file is also shared outside the compressed directory.

errors-different-counts-a11y-analyze-errors-summary.csv: This file contains the counts of errors that occur in notebooks across different themes.

nb_processed_cell_html.csv: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks.

nb_first_interactive_cell.csv: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook.

nb_processed.csv: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information.

processed_function_calls.csv: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Campos, José Creissac
Sousa, Emanuel
Margolis, Iara
Macedo, Nuno
Cunha, Alcino
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
csv file for jupyter notebook
figshare.com
txt
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johanna Schultz (2022). csv file for jupyter notebook [Dataset]. http://doi.org/10.6084/m9.figshare.21590175.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21590175.v1
Dataset updated
Nov 21, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Johanna Schultz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
df_force_kin_filtered.csv is the data sheet used for the DATA3 python notebook to analyse kinematics and dynamics combined. It contains the footfalls that hava data for both: kinematics and dynamics. To see how this file is generated, read the first half of the jupyter notebook
c
Dog Food Data Extracted from Chewy (USA) - 4,500 Records in CSV Format
crawlfeeds.com
csv, zip
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Dog Food Data Extracted from Chewy (USA) - 4,500 Records in CSV Format [Dataset]. https://crawlfeeds.com/datasets/dog-food-data-extracted-from-chewy-usa-4-500-records-in-csv-format
Explore at:
zip, csvAvailable download formats
Dataset updated
Apr 22, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Dog Food Data Extracted from Chewy (USA) dataset contains 4,500 detailed records of dog food products sourced from one of the leading pet supply platforms in the United States, Chewy. This dataset is ideal for businesses, researchers, and data analysts who want to explore and analyze the dog food market, including product offerings, pricing strategies, brand diversity, and customer preferences within the USA.

The dataset includes essential information such as product names, brands, prices, ingredient details, product descriptions, weight options, and availability. Organized in a CSV format for easy integration into analytics tools, this dataset provides valuable insights for those looking to study the pet food market, develop marketing strategies, or train machine learning models.

Key Features:

Record Count: 4,500 dog food product records.

Data Fields: Product names, brands, prices, descriptions, ingredients .. etc. Find more fields under data points section.

Format: CSV, easy to import into databases and data analysis tools.

Source: Extracted from Chewy’s official USA platform.

Geography: Focused on the USA dog food market.

Use Cases:

Market Research: Analyze trends and preferences in the USA dog food market, including popular brands, price ranges, and product availability.

E-commerce Analysis: Understand how Chewy presents and prices dog food products, helping businesses compare their own product offerings.

Competitor Analysis: Compare different brands and products to develop competitive strategies for dog food businesses.

Machine Learning Models: Use the dataset for machine learning tasks such as product recommendation systems, demand forecasting, and price optimization.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Crypto Price Monitoring Dataset for On-chain Derivatives Research
zenodo.org
csv
Updated Mar 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Vakhmyanin; Yana Volkovich; Ivan Vakhmyanin; Yana Volkovich (2023). Crypto Price Monitoring Dataset for On-chain Derivatives Research [Dataset]. http://doi.org/10.5281/zenodo.7749133
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7749133
Dataset updated
Mar 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Vakhmyanin; Yana Volkovich; Ivan Vakhmyanin; Yana Volkovich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Crypto Price Monitoring Repository

This repository contains two CSV data files that were created to support the research titled "Price Arbitrage for DeFi Derivatives." This research is to be presented at the IEEE International Conference on Blockchain and Cryptocurrencies, taking place on 5th May 2023 in Dubai, UAE. The data files include monitoring prices for various crypto assets from several sources. The data files are structured with five columns, providing information about the symbol, unified symbol, time, price, and source of the price.

## Data Files

There are two CSV data files in this repository (one for each date):

1. `Pricemon_results_2023_01_13.csv`
2. `Pricemon_results_2023_01_14.csv`

## Data Format

Both data files have the same format and structure, with the following five columns:

1. `symbol`: The trading symbol for the crypto asset (e.g., BTC, ETH).
2. `unified_symbol`: A standardized symbol used across different platforms.
3. `time`: Timestamp for when the price data was recorded (in UTC format).
4. `price`: The price of the crypto asset at the given time (in USD).
5. `source`: The name of the price source for the data.

## Price Sources

The `source` column in the data files refers to the provider of the price data for each record. The sources include:

- `chainlink`: Chainlink Price Oracle
- `mycellium`: Built-in oracle of the Mycellium platform
- `bitfinex`: Bitfinex cryptocurrency exchange
- `ftx`: FTX cryptocurrency exchange
- `binance`: Binance cryptocurrency exchange

## Usage

You can use these data files for various purposes, such as analyzing price discrepancies across different sources, identifying trends, or developing trading algorithms. To use the data, simply import the CSV files into your preferred data processing or analysis tool.

### Example

Here's an example of how you can read and display the data using Python and the pandas library:

import pandas as pd

# Read the data from CSV file
data = pd.read_csv('Pricemon_results_2023_01_13.csv')

# Display the first 5 rows of the data
print(data.head())`

## Acknowledgements

These datasets were recorded and supported by Datamint company (value-added on-chain data provider) and its team.

## Contributing

If you have any suggestions or find any issues with the data, please feel free to contact authors.
c
Fox News dataset is for analyzing media trends and narratives
crawlfeeds.com
csv, zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
May 19, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

Key Features of the Fox News Dataset

Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.

Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.

Format: Provided in CSV format for seamless integration into analytical and research tools.

Why Use This Dataset?

This large dataset is ideal for:

Text Classification: Develop machine learning models to classify and categorize news content.

Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.

Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.

Trend Analysis: Identify shifts in public discourse and media focus over time.

Explore More News Datasets

Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...
zenodo.org
explore.openaire.eu
+1more
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. http://doi.org/10.5281/zenodo.10058142
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058142
Dataset updated
Jan 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Federica Pepe; Vittoria Nardone; Vittoria Nardone; Antonio Mastropaolo; Antonio Mastropaolo; Gerardo Canfora; Gerardo Canfora; Gabriele BAVOTA; Gabriele BAVOTA; Massimiliano Di Penta; Massimiliano Di Penta; Federica Pepe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory
- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)
- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset
- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed
- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library
- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model
- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project
- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1
- `RQ1/RQ1_dataset-list.txt`: list of HF datasets
- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets
- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script
- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`
- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2
- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task
- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling
- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias
- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories
- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3
- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses
- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness
- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name
- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)
- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts
Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
c
Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends |...
crawlfeeds.com
csv, zip
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends | Download Now! [Dataset]. https://crawlfeeds.com/datasets/ebay-products-dataset
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Massive eBay Marketplace Data Collection for E-commerce Intelligence

Unlock the power of online marketplace analytics with our comprehensive eBay products dataset. This premium collection contains 1.29 million products from eBay's global marketplace, providing extensive insights into one of the world's largest e-commerce platforms. Perfect for competitive analysis, pricing strategies, market research, and machine learning applications in e-commerce.

Dataset Overview

Total Products: 1,290,000+ marketplace listings

Source: eBay Global Marketplace

Format: CSV, ZIP compressed

File Size: Optimized compressed format

Coverage: Multi-category product listings across eBay

Complete Data Fields Included

Product Identification

id: Unique eBay product identifiers

name: Complete product titles and names

url: Direct eBay listing page links

epid: eBay Product ID for catalog matching

source: Data source identification

Product Details

raw_product_description: Original unprocessed product descriptions

product_description: Cleaned and formatted product descriptions

brand: Brand names and manufacturer information

mpn: Manufacturer Part Numbers

gtin13: Global Trade Item Numbers (barcodes)

Pricing and Availability

price: Current listing prices

currency: Currency information for international listings

in_stock: Stock availability status

breadcrumbs: Category navigation paths

Visual and Technical Data

images: Product image URLs and references

crawled_at: Data collection timestamps

Key Use Cases

E-commerce Market Research

Analyze eBay marketplace trends and patterns

Study product category performance

Monitor pricing strategies across sellers

Identify high-demand product categories

Competitive Intelligence

Benchmark pricing against eBay marketplace

Analyze product positioning strategies

Study seller competition and market share

Monitor inventory levels and availability

Price Optimization

Develop dynamic pricing algorithms

Analyze price elasticity across categories

Compare marketplace pricing trends

Optimize listing prices for maximum visibility

Machine Learning Applications

Train recommendation systems

Develop price prediction models

Create product categorization algorithms

Build inventory forecasting systems

Target Industries

E-commerce Retailers

Marketplace Strategy: Optimize eBay selling strategies

Pricing Intelligence: Competitive price monitoring

Product Research: Identify profitable product opportunities

Inventory Planning: Demand forecasting and stock optimization

Technology Companies

AI Training Data: Machine learning model development

Analytics Platforms: E-commerce intelligence tools

Price Comparison: Marketplace comparison services

Search Enhancement: Product discovery optimization

Market Research Firms

Industry Reports: E-commerce marketplace analysis

Consumer Behavior: Online shopping pattern studies

Brand Monitoring: Brand performance tracking

Trend Analysis: Market trend identification

Academic Research

E-commerce Studies: Online marketplace research

Business Intelligence: Retail analytics case studies

Data Science Projects: Large-scale dataset analysis

Economic Research: Digital marketplace economics

Data Quality Features

Comprehensive Coverage: 1.29 million unique products

Rich Metadata: Complete product information included

Validated Data: Quality-checked and processed

Structured Format: Ready-to-use CSV format

Global Scope: International marketplace coverage
Library of simulated root images, with different noise levels
zenodo.org
data.niaid.nih.gov
csv, zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Lobet; Iko Koevoets; Manuel Noll; Pierre Tocquin; Patrick E Meyer; Loic Pagès; Claire Périlleux; Guillaume Lobet; Iko Koevoets; Manuel Noll; Pierre Tocquin; Patrick E Meyer; Loic Pagès; Claire Périlleux (2020). Library of simulated root images, with different noise levels [Dataset]. http://doi.org/10.5281/zenodo.208214
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.208214
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Lobet; Iko Koevoets; Manuel Noll; Pierre Tocquin; Patrick E Meyer; Loic Pagès; Claire Périlleux; Guillaume Lobet; Iko Koevoets; Manuel Noll; Pierre Tocquin; Patrick E Meyer; Loic Pagès; Claire Périlleux
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This depository contains the images and associated RSML files used in the paper entitled "Using a structural root system model to evaluate and improve the accuracy of root image analysis pipelines" from the same authors.

It contains:

- 30.000 simulated root images of 10.000 different root systems, with different noise levels (0=null, 1=medium, 3=high);

- 10.000 corresponding RSML files;

- .csv files containing the ground-truth data for each modelled root system (500-data.csv);

- .csv files containing the image descriptors extracted using RIA-J (500-descriptors.csv);

The codes used to generate and analyse this dataset is available here: http://doi.org/10.5281/zenodo.208499

Note about the metrics computed from the RSML files (contained in 500-data.csv): the root systems were simulated in a constrained 2D space (rhizotron-like). Therefore, the RSML_reader plugin computed the lengths of the different roots in 2D only (despite the fact that a Z coordinate is present in the the RSML files).

CORRECTION: There is a scale issue in the 500-descriptors.csv file. "area" and "convexhull" columns values should be divided by 10
c
Comprehensive Grainger Products Dataset - 200K Records in CSV Format
crawlfeeds.com
csv, zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Comprehensive Grainger Products Dataset - 200K Records in CSV Format [Dataset]. https://crawlfeeds.com/datasets/comprehensive-grainger-products-dataset-200k-records-in-csv-format
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 15, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock the power of data with our comprehensive Grainger Products Dataset, featuring over 220,000 meticulously curated records in CSV format. This dataset is an invaluable resource for businesses, researchers, and data scientists looking to optimize their operations, conduct market analysis, or enhance their machine learning models.

Each record in the dataset includes critical fields such as URL, title, brand, SKU, price, pricing unit, product model, product ID, product UNSPSC, breadcrumbs, images, specifications, compliance and restrictions, description, unique ID, and the scraped date. Whether you're analyzing product trends, comparing prices, or developing e-commerce solutions, this dataset provides the depth and breadth of information you need.

Submit you custom requests at grainger products page

Example Use Cases:

Market Research: Analyze product offerings, prices, and brands to stay competitive.

Machine Learning: Train models on rich product data to improve recommendations or search algorithms.

E-commerce Solutions: Integrate with your platform to enhance product listings, optimize pricing strategies, or automate inventory management.

Start leveraging this data to make informed decisions and gain a competitive edge in your industry.

Facebook

Twitter

Click to copy link

Link copied

Cite

M. Mazhar (2023). Real State Website Data [Dataset]. https://www.kaggle.com/datasets/mazhar01/real-state-website-data/code

Real State Website Data

Analyze and Understand Key Features in a CSV Dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 11, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

M. Mazhar

License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Check: End-to-End Regression Model Pipeline Development with FastAPI: From Data Scraping to Deployment with CI/CD Integration

This CSV dataset provides comprehensive information about house prices. It consists of 9,819 entries and 54 columns, offering a wealth of features for analysis. The dataset includes various numerical and categorical variables, providing insights into factors that influence house prices.

The key columns in the dataset are as follows:

Location1: The location of the houses. Location2 column is identical or shorter version of Location1 Year: The year of construction. Type: The type of the house. Bedrooms: The number of bedrooms in the house. Bathrooms: The number of bathrooms in the house. Size_in_SqYds: The size of the house in square yards. Price: The price of the house. Parking_Spaces: The number of parking spaces available. Floors_in_Building: The number of floors in the building. Elevators: The presence of elevators in the building. Lobby_in_Building: The presence of a lobby in the building.

In addition to these, the dataset contains several other features related to various amenities and facilities available in the houses, such as double-glazed windows, central air conditioning, central heating, waste disposal, furnished status, service elevators, and more.

By performing exploratory data analysis on this dataset using Python and the Pandas library, valuable insights can be gained regarding the relationships between different variables and the impact they have on house prices. Descriptive statistics, data visualization, and feature engineering techniques can be applied to uncover patterns and trends in the housing market.

This dataset serves as a valuable resource for real estate professionals, analysts, and researchers interested in understanding the factors that contribute to house prices and making informed decisions in the real estate market.

Clear search

Close search

Google apps

Main menu

Real State Website Data

Mecca Australia Extracted Data in CSV Format

Features:

Applications:

Cultural Clusters Data.csv

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

Replication Package - How Do Requirements Evolve During Elicitation? An...

BevMo Alcoholic Beverage Records Extracted - Download Comprehensive CSV...

The Canada Trademarks Dataset

Replication package for the paper "What do Developers Discuss about Code...

A/B Trial Aggregated Data

Data from: Notably Inaccessible – Data Driven Understanding of Data Science...

Assessing the impact of hints in learning formal specification: Research...

csv file for jupyter notebook

Dog Food Data Extracted from Chewy (USA) - 4,500 Records in CSV Format

Use Cases:

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Crypto Price Monitoring Dataset for On-chain Derivatives Research

Fox News dataset is for analyzing media trends and narratives

Key Features of the Fox News Dataset

Why Use This Dataset?

Explore More News Datasets

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

Comprehensive eBay Products Dataset: Analyze Listings, Prices, and Trends |...

Massive eBay Marketplace Data Collection for E-commerce Intelligence

Dataset Overview

Complete Data Fields Included

Product Identification

Product Details

Pricing and Availability

Visual and Technical Data

Key Use Cases

E-commerce Market Research

Competitive Intelligence

Price Optimization

Machine Learning Applications

Target Industries

E-commerce Retailers

Technology Companies

Market Research Firms

Academic Research

Data Quality Features

Library of simulated root images, with different noise levels

Comprehensive Grainger Products Dataset - 200K Records in CSV Format

Real State Website Data

Analyze and Understand Key Features in a CSV Dataset