Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The solutions are evaluated on two criteria: predicted future Index values and allocated energy from a newly discovered star 1. Index predictions are evaluated using RMSE metric 2. Energy allocation is also evaluated using RMSE metric and has a set of known factors that need to be taken into account
Every galaxy has a certain limited potential for improvement in the index described by the following function:
Potential for increase in the Index = -np.log(Index+0.01)+3
Likely index increase dependent on potential for improvement and on extra energy availability is described by the following function:
Likely increase in the Index = extra energy * Potential for increase in the Index **2 / 1000
in total there are 50000 zillion DSML available for allocation no galaxy should be allocated more than 100 zillion DSML or less than 0 zillion DSML galaxies with low existence expectancy index below 0.7 should be allocated at least 10% of the total energy available
Variable | Description |
---|---|
Index | Unique index from the test dataset in the ascending order |
pred | Prediction for the index of interest |
pred_opt | Optimal energy allocation |
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hackathons are a great way for people to not only learn more about technology but also showcase their existing skills by making projects often in a few hours. This dataset contains data collected from 200 participants of a hackathon conducted for high school students. A lot of columns have been deleted but the remaining columns can be useful to understand the demographic and interests of someone participating in these kind of events.
Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.
File descriptions:
train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the materials used in the session "Care to Share? Investigating Open Science practices adoption among researchers: a hackathon" presented at the Dutch National Open Science Festival on 22nd October 2024.
The data files are derived from: Public Library of Science (2022) PLOS Open Science Indicators. Figshare. Dataset (version 8). https://doi.org/10.6084/m9.figshare.21687686 ad contains two additional fields (Dimensions_Country and Dimensions_FoR) from Dimensions obtained on 15 October 2024, from Digital Science’s Dimensions platform, available at https://app.dimensions.ai.
PLOS-Dataset-for-Hackathon.xlsx
Data pertaining to the PLOS corpus of articles derived from Public Library of Science (2022) PLOS Open Science Indicators. Figshare. Dataset (version 8). https://doi.org/10.6084/m9.figshare.21687686 with additional data from Dimensions.ai.
Comparator-Dataset-for-Hackathon.xlsx
Data pertaining to the Comparator corpus of articles derived from Public Library of Science (2022) PLOS Open Science Indicators. Figshare. Dataset (version 8). https://doi.org/10.6084/m9.figshare.21687686 with additional data from Dimensions.ai.
Care to share resource sheet.pdf
Document outlining the questions to be investigated during the hackathon as well as key information about the dataset.
OSI-Column-Descriptions_v3_Dec23.pdf
This file is taken from Public Library of Science (2022) PLOS Open Science Indicators. Figshare. Dataset (version 8). https://doi.org/10.6084/m9.figshare.21687686. It describes the fields used in the two data files with the exception of Dimensions_Country and Dimensions_FoR. Descriptions for these are listed in the README tabs of the data files.
Courtesy of the European Space Agency.
License: ESA CC BY-SA 3.0 IGO
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
This dataset has been collected from four Epidemiological Surveillance Systems (EBS) to be used in an hackathon dedicated to AMR (antimicrobial resistance) for the MOOD summer school in June 2022. The choosen EBS sources are ProMED, PADI-web, Healthmap and MedISys. The collected data are news dealing with epidemiological information or event. This dataset is composed of 4 sub-datasets for each chosen EBS. Each sub-dataset is annotated according to 3 main classes (New Information, General Information, Not Relevant). For each news labeled as New Information or General Information, another annotation is provided concerning host classification with 7 classes (Humans, Human-animal, Animals, Human-food, Food, Environment, and All). This second annotation provided 4 sub-datasets. The aim of the annotation task is to recognize epidemiological information related to AMR. An annotation guideline is provided in order to ensure an unified annotation and to help the annotators. This dataset can be used to train or evaluate classification approaches to automatically identify text on AMR events and types of AMR issues (e.g. animal, food, etc.) in unstructured data (e.g. news, tweets) and classify these events by relevance for epidemic intelligence purposes.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • Hospital length of stay dataset is part of a hackathon organized by Analytics Vidhya, focusing on healthcare management challenges, particularly in optimizing hospital patient length of stay. This dataset includes detailed information on patient demographics, hospital attributes, and treatment details, which are critical for managing healthcare efficiency.
2) Data Utilization (1) Hospital length of stay data has characteristics that: • The dataset is structured to provide insights into various factors that affect the length of hospital stays. It contains data on numerous variables including patient age, medical conditions, previous admissions, and the type of hospital and care involved. • It supports predictive modeling to help hospitals improve service delivery by accurately forecasting patient stay durations and managing hospital bed occupancy and staffing needs more effectively. (2) Hospital length of stay data can be used to: • Hospital Management: The data can assist in strategic planning and resource allocation, helping hospitals reduce costs while maintaining high care standards. • Research in Healthcare Systems: It serves as a foundational dataset for academic and commercial research aimed at understanding and improving healthcare systems efficiency.
AV HackLive - Guided Community Hackathon!
Data Science competitions can be daunting for someone who has never participated in one. Some of them have hundreds of competitors with top notch industry knowledge and splendid past record in such hackathons.
Thus a lot of beginners are apprehensive about getting started with these hackathons
The top 3 questions that are commonly asked:
Is it even worth it if I have minimal chance of winning? How do I start? How can I improve my rank in the future? Let’s answer the first question before we go further.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract:
This dataset of breast cancer patients was obtained from the 2017 November update of the SEER Program of the NCI, which provides information on population-based cancer statistics. The dataset involved female patients with infiltrating duct and lobular carcinoma breast cancer (SEER primary cites recode NOS histology codes 8522/3) diagnosed in 2006-2010. Patients with unknown tumor size, examined regional LNs, regional positive LNs, and patients whose survival months were less than 1 month were excluded; thus, 4024 patients were ultimately included.
Inspiration:
This dataset uploaded to U-BRITE for "AI against CANCER DATA SCIENCE HACKATHON"
https://cancer.ubrite.org/hackathon-2021/
Acknowledgements
JING TENG, January 18, 2019, "SEER Breast Cancer Data", IEEE Dataport, doi: https://dx.doi.org/10.21227/a9qy-ph35.
https://ieee-dataport.org/open-access/seer-breast-cancer-data
U-BRITE last update date: 07/21/2021
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Electricity Consumption’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/utathya/electricity-consumption on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Company of Electrolysia supplies electricity to the city. It is looking to optimise its electricity production based on the historical electricity consumption of the people of Electrovania.
The company has hired you as a Data Scientist to investigate the past consumption and the weather information to come up with a model that catches the trend as accurate as possible. You have to bear in mind that there are many factors that affect electricity consumption and not all can be measured. Electrolysia has provided you this data on hourly data spanning five years.
For this competition, the training set is comprised of the first 23 days of each month and the test set is the 24th to the end of the month, where the public leaderboard is based on the first two days of test, whereas the private leaderboard considers the rest of the days. Your task is to predict the electricity consumption on hourly basis.
Note that you cannot use future information to model past consumption. For example, you cannot use February 2017 data to predict last week of January 2017 information.
It represents a fictitious time period wherein we are to predict future electricity consumption.
This data is from Analytics Vidya hackathon. The hackathon is closed now.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
This is a small dataset as a part of huge dataset of breast cancer images. The images are mammograms.
Instructions:
One can use these images for experimentation on detection and analysis of breast cancer.
Inspiration:
This dataset uploaded to U-BRITE for "AI against CANCER DATA SCIENCE HACKATHON"
https://cancer.ubrite.org/hackathon-2021/
Acknowledgements
G R Sinha, Bhagwati Charan Patel, December 27, 2019, "Mammograms-Breast Cancer Images", IEEE Dataport, doi: https://dx.doi.org/10.21227/9f0p-qx37.
https://ieee-dataport.org/documents/mammograms-breast-cancer-images
U-BRITE last update date: 07/21/2021
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘uHack Sentiments 2.0: Decode Code Words’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/manishtripathi86/uhack-sentiments-20-decode-code-words on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The challenge here is to analyze and deep dive into the natural language text (reviews) and bucket them based on their topics of discussion. Furthermore, analyzing the overall sentiment will also help the business to make tangible decisions.
The data set provided to you has a mix of customer reviews for products across categories and retailers. We would like you to model on the data
to bucket the future reviews in their respective topics (Note: A review can talk about multiple topics)
Overall polarity (positive/negative sentiment)
Topics (Components, Delivery and Customer Support, Design and Aesthetics, Dimensions, Features, Functionality, Installation, Material, Price, Quality and Usability) Polarity (Positive/Negative) Note: The target variables are all encoded in the train dataset for convenience. Please submit the test results in the similar encoded fashion for us to evaluate your results.
| | Field Name Data Type Purpose Variable type
Id Integer Unique identifier for each review Input
Review String Review written by customers on a retail website Input
Components String 1: aspects related to components Target
0: None
Delivery and Customer Support String 1: some aspects related to delivery, return, exchange and customer support Target
0: None
Design and Aesthetics String 1: some aspects related to components Target
0: None
Dimensions String 1: related to product dimension and size Target
0: None
Features String 1: related to product features Target
0 : None
Functionality String 1: related to working of a product Target
0: None
Installation String 1: related to installation of the product Target
0: None
Material String 1: related to material of the product Target
0: None
Price String 1: related to pricing details of a product Target
0: None
Quality String 1: related to quality aspects of a product Target
0: None
Usability String 1: related to usability of a product Target
0: None
Polarity Integer 1: Positive sentiment; Target
0: Negative Sentiment | |
| --- | --- |
| | | | |
| --- | --- |
| | |
Skills: Text Pre-processing – Lemmatization , Tokenization, N-Grams and other relevant methods Multi-Class Classification, Multi-label Classification Optimizing Log Loss
Overview Ugam, a Merkle company, is a leading analytics and technology services company. Our customer-centric approach delivers impactful business results for large corporations by leveraging data, technology, and expertise.
We consistently deliver superior, impactful results through the right blend of human intelligence and AI. With 3300+ people spread across locations worldwide, we successfully deploy our services to create success stories across industries like Retail & Consumer Brands, High Tech, BFSI, Distribution, and Market Research & Consulting. Over the past 21 years, Ugam has been recognized by several firms including Forrester and Gartner, named the No.1 data science company in India by Analytics Insight, and certified as a Great Place to Work®.
Problem Statement: The last two decades have witnessed a significant change in how consumers purchase products and express their experience/opinions in reviews, posts, and content across platforms. These online reviews are not only useful to reflect customers’ sentiment towards a product but also help businesses fix gaps and find potential opportunities which could further influence future purchases.
Participants need develop a machine learning model that can analyse customers’ sentiments based on their reviews and feedback.
NOTE: The prize money will be for the interested candidates who are willing to get interviewed or hired by Ugam. Winner are requested to come to the Machine Leaning Developers Summit2022, happening at Bangalore, for receiving the prize money.
dataset link: https://machinehack.com/hackathon/uhack_sentiments_20_decode_code_words/overview
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ORD for the Sciences Hackathon - Vehicles Detection
[!CAUTION] This project is an example of a hackathon project. The quality of the data produced has not been evaluated. Its goal is to provide an example on how a dataset can be update to Hugginface.
This is an example of a hackathon project presented to ORD for the sciences hackathon using the openly available pNeuma vision dataset.
Go here if you wanna know more about the hackathon EPFL pNeuma project… See the full description on the dataset page: https://huggingface.co/datasets/katospiegel/ordfts-hackathon-pneuma-vehicles-segmentation.
Participants needed to develop a solution that improves the effectiveness of SMS targeting in such a way that they only send messages to customers who are motivated to make a purchase.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
This dataset has information of 83 patients from India. This dataset contains patients’ clinical history, histopathological features, and mammogram. The distinctive aspect of this dataset lies in its collection of mammograms that have benign tumors and used in subclassification of benign tumors.
Instructions:
This datasest contains a zip folder of 80 mammograms and an excel file having mammographic features, histopathological features as well as clinical fatures of all the patients.
Inspiration:
This dataset uploaded to U-BRITE for "AI against CANCER DATA SCIENCE HACKATHON"
https://cancer.ubrite.org/hackathon-2021/
Acknowledgements
Manish Joshi, Aparna Bhale, Unmesh Takalkar, May 9, 2021, "Benign Breast Tumor Dataset", IEEE Dataport, doi: https://dx.doi.org/10.21227/6sda-hn78.
https://ieee-dataport.org/open-access/benign-breast-tumor-dataset
U-BRITE last update date: 07/09/2021
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
Microwave-based breast cancer detection is a growing field that has been investigated as a potential novel method for breast cancer detection. Breast microwave sensing (BMS) systems use low-powered, non-ionizing microwave signals to interrogate the breast tissues. While some BMS systems have been evaluated in clinical trials, many challenges remain before these systems can be used as a viable clinical option, and breast phantoms (breast models) allow for rigorous and controlled experimental investigations. This dataset, the University of Manitoba Breast Microwave Imaging Dataset (UM-BMID), contains S-parameter measurements from experimental scans of MRI-derived breast phantoms, obtained with a pre-clinical breast microwave sensing system operating over 1-8 GHz. The dataset consists of measurements from over 1250 scans of a diverse array of phantoms. The phantom array consists of phantoms of various sizes and breast densities. The .stl files used to produce the 3D-printed phantoms are also included in the dataset. We hope that this dataset can serve as a resource for researchers in breast microwave sensing to evaluate signal processing, image reconstruction, and tumour detection methods.
Inspiration:
This dataset uploaded to U-BRITE for "AI against CANCER DATA SCIENCE HACKATHON"
https://cancer.ubrite.org/hackathon-2021/
Acknowledgements
Tyson Reimer, Jordan Krenkevich, Stephen Pistorius, June 16, 2021, "University of Manitoba Breast Microwave Imaging Dataset (UM-BMID)", IEEE Dataport, doi: https://dx.doi.org/10.21227/1y0z-8t98.
https://ieee-dataport.org/open-access/university-manitoba-breast-microwave-imaging-dataset-um-bmid
U-BRITE last update date: 07/21/2021
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GDSC 2024
This dataset contains the results from the Capgemini Global Data Science Challenge (GDSC) 2024 Arena Battles, where AI education policy experts competed to provide the best answers to questions about global education trends and literacy.Quick Links:
Case study GDSC Overview GDSC 7 Overview Video (Short) GDSC 7 Overview Video (Long) GDSC Website
Background
The Capgemini Global Data Science Challenge (GDSC) is an annual, purpose-driven hackathon that… See the full description on the dataset page: https://huggingface.co/datasets/Endercold/GDSC-2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data from germanwatch.org site.
File descriptions
id = video id video_duration = duration of video coding_standard = coding standard used for the video width = width of video in pixels height = height of video in pixels bitrate = video bitrate framerate = actual video frame rate i_frames = number of i-frames in the video p_frames = number of p-frames in the video b_frames = number of b-frames in the video frames = number of frames in video i_size = total size in byte of i videos p_size = total size in byte of p videos b_size = total size in byte of b videos size = total size of video coding_standard_output = output coding standard used for processing bitrate_output = output bitrate used for processing framerate_output = output framerate used for processing output_width = output width in pixel used for processing output_height = output height used in pixel for processing allocated _memory = total coding standard allocated memory for processing total_processing_time = total time taken for processing
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A collection of useful datasets extracted from https://packages.ecosyste.ms and https://repos.ecosyste.ms for use at the CZI Hackathon: Mapping the Impact of Research Software in Science.
All data is provided as NDJSON (new line delimited JSON), each line represents a valid JSON object, and they are separated by newline characters. There are python and R libraries for reading these files, or you can maually read each line and parse each line as a single JSON object.
Each ndjson file has been compressed with gzip (actual command: `tar -czvf`) to reduce download size, they expand to significantly bigger files after extraction.
Package names from cran, bioconductor and pypi that have been parsed by the software-mentions project (data: https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c) are collected together with their latest release at time of publishing along with the names of their dependencies, those dependency names have then also been recursively fetched with latest release and dependencies until the full list of transitive dependencies is included.
Note: This approach uses a simplified method of dependency resolution, always picking the latest version of each package rather than taking into account each dependencies specific version range requirements, this is primarily due to time constraints and allows all software ecosystems to be processed in the same way. A future improvement would be to use each package ecosystem's specific dependency resolution algorithm to compute the full transitive dependency tree for each mentioned software package.
Two different approaches were taken for collecting data for referenced GitHub mentions:
1. `github.ndjson` is metadata for each repository from GitHub, including "manifest" files which are known files that contain dependency information for a project such as requirements.txt, DESCRIPTION and package.json, parsed using https://github.com/ecosyste-ms/bibliothecary, which may include transitive dependencies that have been discovered in a `lockfile` within the repository.
2. `github_packages.ndjson` is metadata for each package that was found on any package manager that references the GitHub url as it's repository url/source/homepage, these packages, like the cran and pypi data above, include the latest release and their direct dependencies. There may be more than one package for each GitHub URL as it is a one to many relationship. `github_packages_with_transitive.ndjson` follows the same format but also includes the extra resolved transitive dependencies of all packages using the same approach as with cran and pypi data above with the same caveats.
There are also many more ecosystems referenced in these files than just cran, bioconductor and pypi, https://packages.ecosyste.ms provides a standardized metadata format for all of them to enable comparison and simplification of automation.
If you would like any help, support or more data from Ecosyste.ms please do get in touch via email: hello@ecosyste.ms or open an issue on GitHub: https://github.com/ecosyste-ms/packages/issues
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The solutions are evaluated on two criteria: predicted future Index values and allocated energy from a newly discovered star 1. Index predictions are evaluated using RMSE metric 2. Energy allocation is also evaluated using RMSE metric and has a set of known factors that need to be taken into account
Every galaxy has a certain limited potential for improvement in the index described by the following function:
Potential for increase in the Index = -np.log(Index+0.01)+3
Likely index increase dependent on potential for improvement and on extra energy availability is described by the following function:
Likely increase in the Index = extra energy * Potential for increase in the Index **2 / 1000
in total there are 50000 zillion DSML available for allocation no galaxy should be allocated more than 100 zillion DSML or less than 0 zillion DSML galaxies with low existence expectancy index below 0.7 should be allocated at least 10% of the total energy available
Variable | Description |
---|---|
Index | Unique index from the test dataset in the ascending order |
pred | Prediction for the index of interest |
pred_opt | Optimal energy allocation |