2 datasets found

Data Scientists vs Size of Datasets
kaggle.com
Updated Oct 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/datasets/laurae2/data-scientists-vs-size-of-datasets/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laurae
Description
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

What can you do with the data?

Look up whether Kagglers has "stronger" hardware than non-Kagglers

Whether there is a correlation between a preferred data set size and hardware

Is proficiency a predictor of specific preferences?

Are data scientists more Intel or AMD?

How spread is GPU computing, and is there any relationship with Kaggling?

Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

Your intended usage (research? business use? blogging?...)

Your first/last name

Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

Data set size:

Small: under 1 million values

Medium: between 1 million and 1 billion values

Large: over 1 billion values

For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No

DS_2 = Do you enjoy working with large data sets? => Yes or No

DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large

DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No

DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No

W_1 = What is your CPU brand? => Intel or AMD

W_2 = Do you have access to a remote server to perform large workloads? => Yes or No

W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s

W_4 = How many cores do you have to work with data sets? => numeric output

W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output

W_6 = Do you do GPU computing? => Yes or No

W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)

W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)

W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Data from: Bike Sharing Dataset
kaggle.com
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/datasets/laurae2/data-scientists-vs-size-of-datasets/suggestions?status=pending

Data Scientists vs Size of Datasets

Hardware & Brain battle between data set size and data scientists

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 18, 2016

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Laurae

Description

This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

What can you do with the data?

Look up whether Kagglers has "stronger" hardware than non-Kagglers
Whether there is a correlation between a preferred data set size and hardware
Is proficiency a predictor of specific preferences?
Are data scientists more Intel or AMD?
How spread is GPU computing, and is there any relationship with Kaggling?
Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

Your intended usage (research? business use? blogging?...)
Your first/last name

Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

Data set size:

Small: under 1 million values
Medium: between 1 million and 1 billion values
Large: over 1 billion values

For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No
DS_2 = Do you enjoy working with large data sets? => Yes or No
DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large
DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No
DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No
W_1 = What is your CPU brand? => Intel or AMD
W_2 = Do you have access to a remote server to perform large workloads? => Yes or No
W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s
W_4 = How many cores do you have to work with data sets? => numeric output
W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output
W_6 = Do you do GPU computing? => Yes or No
W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)
W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)
W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.

Clear search

Close search

Google apps

Main menu

Data Scientists vs Size of Datasets

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Data Scientists vs Size of Datasets

Hardware & Brain battle between data set size and data scientists