https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
https://opendata.cityofnewyork.us/
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.
The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.
Banner Photo by @bicadmedia from Unplash.
On which New York City streets are you most likely to find a loud party?
Can you find the Virginia Pines in New York City?
Where was the only collision caused by an animal that injured a cyclist?
What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
Column | Description |
code_blocks_index | Global index linking code blocks to markup_data.csv. |
kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
code_block_id |
Position of the code block within the notebook. |
code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
Column | Description |
kernel_id | Identifier for the Kaggle Jupyter notebook. |
kaggle_score | Performance metric of the notebook. |
kaggle_comments | Number of comments on the notebook. |
kaggle_upvotes | Number of upvotes the notebook received. |
kernel_link | URL to the notebook. |
comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
Column | Description |
comp_name | Name of the Kaggle competition. |
description | Overview of the competition task. |
data_type | Type of data used in the competition. |
comp_type | Classification of the competition. |
subtitle | Short description of the task. |
EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
data_sources | Links to datasets used. |
metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
Column | Description |
code_block | Machine learning code block. |
too_long | Flag indicating whether the block spans multiple semantic types. |
marks | Confidence level of the annotation. |
graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id
column.comp_name
. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index
column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank
), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Top Streamers on Twitch’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aayushmishra1512/twitchdata on 30 September 2021.
--- Dataset description provided by original source is as follows ---
Gaming is a very big industry now. Every year there are millions of Dollars invested in Esports and many new companies want to invest in the Esports scene now. One of bigegest ever deals was when Mixer opened up and brought Ninja and Shroud to their platform from twitch. But Twitch has been a home to streamers since day 1 and now that Mixer has been shut down, streamers are returning to the platform again.Millions, if not billions, watch twitch streams everyday and i myself like to watch twitch streams. So i put together Top 1000 Streamers from past one year who were streaming on twitch.
This data consists of different things like number of viewers, number of active viewers, followers gained and many other relevant columns regarding a particular streamer. It has 11 different columns with all the necessary information that is needed.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
This dataset is an update of Sharan Naribole's earlier dataset titled H-1B Visa Petitions 2011-2016. Inspired by his work and using a modified, updated version of his R script, I wrangled U.S. H1-B visa petitions data for the years 2015-2019. The previous dataset can be found here: link to Sharan's dataset.
H1-B visas are the most common visa status applied for and held by international students once they begin working full-time in the U.S.
Please see the original dataset for more context information.
This dataset includes 5 years worth of H1-B visa petitions in the U.S. The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.
This file contains H1-B data from the LCA Program data files (H1-B, H-1B1, E-3). These datasets can be found on the U.S. Department of Labor Site.
Shout out to Sharan Naribole for the original project idea and easy-to-update R script.
U.S. Department of Labor Data Source
Which states/cities/companies provide the most H1-B visas? For your job description, which city should you be in to have the most opportunities? Which companies should you apply to if you would like the best odds of obtaining a H1-B visa?
Wikipedia - Image/Caption Matching Kaggle Competition.
This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.
In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wit_kaggle', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/wit_kaggle-train_with_extended_features-1.0.2.png" alt="Visualization" width="500px">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.
Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.
The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper:
"A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1
Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.
The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv
. This file includes an additional column called "Source" to indicate the source of each news article.
Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.
This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2
If you use our data, please cite the following paper:
@inproceedings{MNAD2021,
author = {Mourad Jbene and
Smail Tigani and
Rachid Saadane and
Abdellah Chehri},
title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization},
year = {2021},
publisher = {{IEEE}},
booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})}
doi = {10.1109/dasa53625.2021.9682402},
url = {https://doi.org/10.1109/dasa53625.2021.9682402},
}
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:
1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures
Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of over 1 million images covering 340 classes of doodles. It contains grayscale images of doodles, organized by class, extracted from the Quick, Draw! dataset. Each image represents a hand-drawn sketch from various categories, processed to be ready for machine learning tasks.
This dataset is a clean, processed, and easy-to-use version of the original Quick, Draw! dataset by Google, which has approximately 50 million images.
This dataset is your playground for: - Training and evaluating machine learning models, especially for image classification tasks. - Conducting research and educational activities with a well-organized set of doodle images. - Benchmarking doodle recognition algorithms.
This dataset is a clean and processed version of the original Quick, Draw! dataset by Google, which contains approximately 50 million images. Special thanks to the original creators and contributors of the dataset.
This dataset is shared under the CC BY 4.0 license. Please attribute the source when using this dataset in your work.
We hope this dataset serves as a valuable resource for your projects. Happy coding and may your models achieve high accuracy!
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.
What can you do with the data?
I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:
Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.
Data set size:
For the data, it uses the following fields (DS = Data Scientist, W = Workstation):
You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
I am working in the area of Privacy Preserving Big Data Publishing. The state-of-art approaches were tested on Adult dataset. I found that Adult dataset is available at UCI repository but synthetic version wasn't available anywhere. As I am working with big data, I need large size of data to justify my contribution. Therefore, I created my own version of synthetic datasets with 100 thousands, 1 million, 10 millions and 100 millions numbers of records. Here I am sharing the original Adult dataset with approx 33 thousands records and the synthesis versions Adult100k, Adult 1m, Adult10m and Adult100m.
Adult dataset contains census information.
I would like to thank UCI repository for providing the base dataset without which I may not be able to synthesis the large data.
The datasets might be helpful to all those who wants to work on Big Data Privacy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset
The data is stored in the format:
{
"input": "This is an grammatically wrong sentences.",
"output": "This is a grammatically correct sentence."
}
The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.
We hope that this dataset will help others by saving the trouble and time of generating this dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains 250 million rows
of information from the ~500 bike stations
of the Barcelona public bicycle sharing service. The data consists in time series information of the electric and mechanical bicycles available every 4 minutes
aprox., from March 2019 to March 2024
(latest available csv file, with the idea of being updated with every new month's file). This data could inspire many different use cases, from geographical data analysis to hierarchical ML time series models or Graph Neural Networks among others. Feel free to create a New Notebook from this page to use it and share your ideas with everyone!
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3317928%2F64409b5bd3c220993e05f5e155fd8c25%2Fstations_map_2024.png?generation=1713725887609128&alt=media" alt="">
Every month's information is separated in a different file as {year}_{month}_STATIONS.csv
. Then the metadata info of every station has been simplified and compressed in the {year}_INFO.csv
files where there is a single entry for every station and day, separated in a different file for every year.
The original data has some different errors, few of them have been already corrected but there are still some missing values, columns with wrong data types and other fewer artifacts or missing data. From time to time I may be manually correcting more of those.
The data is collected from the public BCN Open Data website, which is available for everyone (some resources need from creating a free account and token): - Stations data: https://opendata-ajuntament.barcelona.cat/data/en/dataset/estat-estacions-bicing - Stations info: https://opendata-ajuntament.barcelona.cat/data/en/dataset/informacio-estacions-bicing
You can find more information in them.
Please, consider upvoting this dataset if you find it interesting! 🤗
Some observations:
The historical data for June '19 does not have data for the 20th between 7:40 am and 2:00 pm.
The historical data for July '19 does not have data from the 26th at 1:30 pm until the 29th at 10:40 am.
The historical data for November '19 may not have some data from 10:00 pm on the 26th to 11:00 am on the 27th.
The historical data for August '20 does not have data from the 7th at 2:25 am until the 10th at 10:40 am.
The historical data for November '20 does not have data on the following days/times: 4th from 1:45 am to 11:05 am 20th from 7:50 pm to the 21st at 10:50 am 27th from 2:50 am to the 30th at 9:50 am.
The historical data for August '23 does not have data from the 22nd to the 31st due to a technical incident.
The historical data for September '23 does not have data from the 1st to the 5th due to a technical incident.
The historical data for February '24 does not have data on the 5th between 12:50 pm and 1:05 pm.
Others: Due to COVID-19 measures, the Bicing service was temporarily stopped, reflecting this situation in the historical data.
Field Description:
Array of data for each station:
station_id
: Identifier of the station
num_bikes_available
: Number of available bikes
num_bikes_available_types
: Array of types of available bikes
mechanical
: Number of available mechanical bikes
ebike
: Number of available electric bikes
num_docks_available
: Number of available docks
is_installed
: The station is properly installed (0-NO,1-YES)
is_renting
: The station is providing bikes correctly
is_returning
: The station is docking bikes correctly
last_reported
: Timestamp of the station information
is_charging_station
: The station has electric bike charging capacity
status
: Status of the station (IN_SERVICE=In service, CLOSED=Closed)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.
The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.
The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder) and a smaller version of the dataset (all_stocks_1yr.csv) with only the past year's stock data for those wishing to use something more manageable in size.
The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv and all_stocks_1yr.csv contain this same data, presented in merged .csv files. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.
All the files have the following columns: Date - in format: yy-mm-dd Open - price of the stock at market open (this is NYSE data so all in USD) High - Highest price reached in the day Low Close - Lowest price reached in the day Volume - Number of shares traded Name - the stock's ticker name
I scraped this data from Google finance using the python library 'pandas_datareader'. Special thanks to Kaggle, Github and The Market.
This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:
Context:
Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.
Inspiration:
The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.
Dataset Information:
The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:
Use Cases:
Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
This dataset was created by FORSEES WRITING
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
PLEASE UPVOTE IF YOU LIKE THIS CONTENT! 😍
Duolingo is an American educational technology company that produces learning apps and provides language certification. There main app is considered the most popular language learning app in the world.
To progress in their learning journey, each user of the application needs to complete a set of lessons in which they are presented with the words of the language they want to learn. In an infinite set of lessons, each word is applied in a different context and, on top of that, Duolingo uses a spaced repetition approach, where the user sees an already known word again to reinforce their learning.
Each line in this file refers to a Duolingo lesson that had a target word to practice.
The columns are as follows:
p_recall
- proportion of exercises from this lesson/practice where the word/lexeme was correctly recalledtimestamp
- UNIX timestamp of the current lesson/practice delta
- time (in seconds) since the last lesson/practice that included this word/lexemeuser_id
- student user ID who did the lesson/practice (anonymized)learning_language
- language being learnedui_language
- user interface language (presumably native to the student)lexeme_id
- system ID for the lexeme tag (i.e., word)lexeme_string
- lexeme tag (see below)history_seen
- total times user has seen the word/lexeme prior to this lesson/practicehistory_correct
- total times user has been correct for the word/lexeme prior to this lesson/practicesession_seen
- times the user saw the word/lexeme during this lesson/practicesession_correct
- times the user got the word/lexeme correct during this lesson/practiceThe lexeme_string
column contains a string representation of the "lexeme tag" used by Duolingo for each lesson/practice (data instance) in our experiments. The lexeme_string field uses the following format:
`surface-form/lemma
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/
Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:
Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015
This dataset is deprecated and not being updated.
Fork this kernel to get started with this dataset.
https://opendata.cityofnewyork.us/
This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.
The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.
Banner Photo by @bicadmedia from Unplash.
On which New York City streets are you most likely to find a loud party?
Can you find the Virginia Pines in New York City?
Where was the only collision caused by an animal that injured a cyclist?
What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png