68 datasets found

New York Taxi Data 2009-2016 in Parquet Fomat
academictorrents.com
bittorrent
Updated Jul 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
Explore at:
bittorrent(35078948106)Available download formats
Dataset updated
Jul 1, 2017
Dataset provided by
New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
Authors
New York Taxi and Limousine Commission
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Area covered
New York
Description
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar
Surface Water - Habitat Results
catalog.data.gov
datasets.ai
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2024). Surface Water - Habitat Results [Dataset]. https://catalog.data.gov/dataset/surface-water-habitat-results
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
California State Water Resources Control Board
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
h
example-space-to-dataset-parquet
huggingface.co
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepGHS (2024). example-space-to-dataset-parquet [Dataset]. https://huggingface.co/datasets/deepghs/example-space-to-dataset-parquet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2024
Dataset authored and provided by
DeepGHS
Description
deepghs/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community
h
example-space-to-dataset-parquet
huggingface.co
Updated Apr 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HannibalY (2024). example-space-to-dataset-parquet [Dataset]. https://huggingface.co/datasets/ethix/example-space-to-dataset-parquet
Explore at:
Dataset updated
Apr 14, 2024
Authors
HannibalY
Description
ethix/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community
Datasets of the CIKM resource paper "A Semantically Enriched Mobility...
zenodo.org
zip
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658129
Dataset updated
Jun 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

Input data

The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:

traj_id: trajectory identifier;

user: user identifier;

lat: latitude of a trajectory sample;

lon: longitude of a trajectory sample;

time: timestamp of a sample;

pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:

osmid: POI OSM identifier

element_type: POI OSM element type

name: POI native name;

name:en: POI English name;

wikidata: POI WikiData identifier;

geometry: geometry associated with the POI;

category: POI category.

social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:

tweet_ID: identifier of the post;

text: post's text;

tweet_created: post's timestamp;

uid: identifier of the user who posted.

weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:

DATE: date in which the weather observation was recorded;

TAVG_C: average temperature in celsius;

DESCRIPTION: weather conditions.

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:

(1) we filtered out trajectories having less than 2 samples;

(2) we filtered noisy samples inducing velocities above 300km/h:

(3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:

datetime, which indicates when a stop segments starts;

leaving_datetime, which indicates when a stop segment ends;

uid, the trajectory user's identifier;

tid, the trajectory's identifier;

lat, the stop segment's centroid latitude;

lng, the stop segment's centroid longitude.
NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).

moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:

datetime, which indicates when a sample's timestamp;

uid, the samples' trajectory user's identifier;

tid, the sample's trajectory's identifier;

lat, the sample's latitude;

lng, the sample's longitude;

move_id, the identifier of a move segment.
NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

The first set of columns concerns a stop's charachteristics:

stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;

geometry_stop, which is a Shapely Point representing a stop's centroid;

geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

index_poi, which is the index of the associated POI in the pois.parqet file;

osmid, which is the identifier given by OpenStreetMap to the POI;

name, the POI's name;

wikidata, the POI identifier on wikidata;

category, the POI's category;

geometry_poi, a Shapely (multi)polygon describing the POI's geometry;

distance, the distance between the stop segment's centroid and the POI.

enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:

systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;

frequency, the number of systematic stops within a systematic stop's cluster;

home, the probability that the systematic stop's cluster represents the home of the associated user;

work, the probability that the systematic stop's cluster represents the workplace of the associated user;

other,
Surface Water - Habitat Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
Explore at:
pdf, csv, zipAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Surface Water - Benthic Macroinvertebrate Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jun 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
Explore at:
pdf, zip, csvAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
PSYCHE-D: predicting change in depression severity using person-generated...
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
Explore at:
pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5085146
Dataset updated
Jul 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
Description
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

Dataset description

Parquet file, with:

35694 rows

154 columns

The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

The data subset used in this work comprises the following:

Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study

Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities

Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month

Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
P
DataSeeds.AI-Sample-Dataset-DSD Dataset
paperswithcode.com
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). DataSeeds.AI-Sample-Dataset-DSD Dataset [Dataset]. https://paperswithcode.com/dataset/dataseeds-ai-sample-dataset-dsd
Explore at:
Dataset updated
Jun 5, 2025
Description
Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.

Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.

Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process
h
pashto_speech_20k
huggingface.co
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanif Rahman (2025). pashto_speech_20k [Dataset]. https://huggingface.co/datasets/ihanif/pashto_speech_20k
Explore at:
Dataset updated
Apr 24, 2025
Authors
Hanif Rahman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Pashto Synthetic Speech Dataset Parquet (20k)

This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.

Dataset Information

Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.
Z
CKW Smart Meter Data
data.niaid.nih.gov
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barahona Garzon, Braulio (2024). CKW Smart Meter Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13304498
Explore at:
Dataset updated
Sep 22, 2024
Dataset authored and provided by
Barahona Garzon, Braulio
Description
Overview

The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.

Data set A - anonimised smart meter data

Data set B - aggregated smart meter data

Contents of this data set

This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet

The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:

id: the anonymized counter ID (text)

timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)

value_kwh: the consumption in kWh in the time window under consideration (float)

In this archive, data from:

| Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |

was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.

A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.

It consists of the following parquet files:

| Size | Date | Name |

|------|------|------|

| 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |

| 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |

| 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |

| 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |

| 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |

| 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |

| 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |

| 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |

| 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |

| 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |

| 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |

| 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |

| 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |

| 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |

| 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |

| 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |

| 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |

| 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |

| 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |

| 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |

| 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |

| 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |

| 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |

| 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |

| 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |

| 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |

| 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |

| 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |

| 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |

| 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |

| 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |

| 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |

| 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |

| 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |

| 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |

| 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |

| 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |

| 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |

| 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |

| 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |

| 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |

| 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |

| 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |

| 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |

| 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |

| 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |

| 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |

| 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |

| 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |

| 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |

| 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |

| 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |

| 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |

| 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |

| 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |

| 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |

| 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |

| 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |

| 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |

| 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |

| 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |

| 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |

| 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |

| 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |

| 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |

where each file contains data from a single smart meter.

Acknowledgement

The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.
EEL mouse sagittal atlas 168 genes spatial RNA data
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Borm (2023). EEL mouse sagittal atlas 168 genes spatial RNA data [Dataset]. http://doi.org/10.6084/m9.figshare.20324814.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20324814.v4
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lars Borm
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw RNA locations of the mouse atlas produced by EEL FISH for 168 genes.

RNA files are in the .parquet format which can be opened with FISHscale (https://github.com/linnarsson-lab/FISHscale) or any other parquet file reader (https://arrow.apache.org/docs/index.html)

RNA .parquet files Seven sagittal sections of the mouse brain with 168 detected genes, sampled at the medial-lateral positions of -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm and 3600 µm measured from the midline. Position and gene label for all RNA molecules. "c_px_microscope_stitched" contains X coordinates. "r_px_microscope_stitched" contians Y coordinates. The unit are pixels with a size of 0.18 micrometer. Multiply by 0.18 to get um scale. "Valid" Boolean column where a 1 means that the molecule is detected inside the tissue section. A zero means the molecule is detected outside.

Tissue polygons .csv files CSV files demarking the sample borders for the 7 mouse atlas sections. -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm, 3600 µm. These polygons were used to generate the "Valid" column. If you want to make your own selection please have a look at the code in: https://github.com/linnarsson-lab/FISHscale/blob/master/FISHscale/utils/inside_polygon.py

Gene colors .pkl file Pickled Python dictionary with gene colors used in the paper for the mouse atlas.
Surface Water - Aquatic Organism Tissue Sample Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Aquatic Organism Tissue Sample Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-aquatic-organism-tissue-sample-results
Explore at:
pdf, csv, zipAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one.

The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
d
Avery Library Historic Preservation and Urban Planning web archive...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick; Sala, Christine; Thurman, Alex (2023). Avery Library Historic Preservation and Urban Planning web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/Z68EVJ
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/Z68EVJ
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ruest, Nick; Sala, Christine; Thurman, Alex
Description
Web archive derivatives of the Avery Library Historic Preservation and Urban Planning collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1757-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url The cul-1757-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz
Feature Engineering Data
kaggle.com
Updated Jul 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mat Leonard
Description
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

File descriptions

train_sample.csv - Sampled data

Data fields

Each row of the training data contains a click record, with the following features.

ip: ip address of click.

app: app id for marketing.

device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)

os: os version id of user mobile phone

channel: channel id of mobile ad publisher

click_time: timestamp of click (UTC)

attributed_time: if user download the app for after clicking an ad, this is the time of the app download

is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

I'm also including Parquet files with various features for use within the course.
g
California State Water Resources Control Board - Surface Water - Aquatic...
gimi9.com
Updated Feb 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). California State Water Resources Control Board - Surface Water - Aquatic Organism Tissue Sample Results | gimi9.com [Dataset]. https://gimi9.com/dataset/california_surface-water-aquatic-organism-tissue-sample-results/
Explore at:
Dataset updated
Feb 18, 2021
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
California
Description
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
Surface Water - Aquatic Organism Tissue Sample Results
catalog.data.gov
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2024). Surface Water - Aquatic Organism Tissue Sample Results [Dataset]. https://catalog.data.gov/dataset/surface-water-aquatic-organism-tissue-sample-results
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
California State Water Resources Control Board
Description
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
U
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
data.usgs.gov
gimi9.com
+1more
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. http://doi.org/10.5066/P13DZ7MP
Explore at:
Unique identifier
https://doi.org/10.5066/P13DZ7MP
Dataset updated
Mar 3, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Mar 1, 2013 - Dec 31, 2023
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each dat ...
Z
PUBG - Match Data
data.niaid.nih.gov
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshy, Vivek (2023). PUBG - Match Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10341919
Explore at:
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Joshy, Vivek
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was generated from skihikingkevin/pubg-match-deaths. It only consists of matches where at least one player has player more than 1 game (in different matches). The data was processed using polars and converted from CSV to Parquet files. A random sample was performed (groupwise) to produce an 80/20 split.
d
University Archives web archive collection derivatives
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/FONRZU
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/FONRZU
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
Description
Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz

Facebook

Twitter

Click to copy link

Link copied

Cite

New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e

New York Taxi Data 2009-2016 in Parquet Fomat

Explore at:

bittorrent(35078948106)Available download formats

Dataset updated

Jul 1, 2017

Dataset provided by

New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc

Authors

New York Taxi and Limousine Commission

License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Area covered

New York

Description

Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

Clear search

Close search

Google apps

Main menu

New York Taxi Data 2009-2016 in Parquet Fomat

Surface Water - Habitat Results

example-space-to-dataset-parquet

example-space-to-dataset-parquet

Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

Input data

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

Surface Water - Habitat Results

Surface Water - Benthic Macroinvertebrate Results

PSYCHE-D: predicting change in depression severity using person-generated...

DataSeeds.AI-Sample-Dataset-DSD Dataset

pashto_speech_20k

CKW Smart Meter Data

EEL mouse sagittal atlas 168 genes spatial RNA data

Surface Water - Aquatic Organism Tissue Sample Results

Avery Library Historic Preservation and Urban Planning web archive...

Feature Engineering Data

File descriptions

Data fields

California State Water Resources Control Board - Surface Water - Aquatic...

Surface Water - Aquatic Organism Tissue Sample Results

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

PUBG - Match Data

University Archives web archive collection derivatives

New York Taxi Data 2009-2016 in Parquet Fomat