75 datasets found

New York Taxi Data 2009-2016 in Parquet Fomat
academictorrents.com
bittorrent
Updated Jul 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e
Explore at:
bittorrent(35078948106)Available download formats
Dataset updated
Jul 1, 2017
Dataset provided by
New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc
Authors
New York Taxi and Limousine Commission
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Area covered
New York
Description
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar
Surface Water - Habitat Results
catalog.data.gov
datasets.ai
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2024). Surface Water - Habitat Results [Dataset]. https://catalog.data.gov/dataset/surface-water-habitat-results
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
California State Water Resources Control Board
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
Datasets of the CIKM resource paper "A Semantically Enriched Mobility...
zenodo.org
zip
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI (2025). Datasets of the CIKM resource paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions" [Dataset]. http://doi.org/10.5281/zenodo.15658129
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15658129
Dataset updated
Jun 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Lettich; Francesco Lettich; Chiara Pugliese; Chiara Pugliese; Guido Rocchietti; Guido Rocchietti; Chiara Renso; Chiara Renso; Fabio PINELLI; Fabio PINELLI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).

The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.

Input data

The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:

raw_trajectories_[paris|nyc]_matbuilder.parquet: these are the datasets of raw preprocessed trajectories, ready for ingestion by the MAT-Builder system, as outputted by the notebook 5 - Ensure MAT-Builder compatibility.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents the sample of some trajectory, and the dataframe has the following columns:

traj_id: trajectory identifier;

user: user identifier;

lat: latitude of a trajectory sample;

lon: longitude of a trajectory sample;

time: timestamp of a sample;

pois.parqet: these are the POI datasets, ready for ingestion by the MAT-Builder system. outputted by the notebook 6 - Generate dataset POI from OpenStreetMap.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a POI, and the dataframe has the following columns:

osmid: POI OSM identifier

element_type: POI OSM element type

name: POI native name;

name:en: POI English name;

wikidata: POI WikiData identifier;

geometry: geometry associated with the POI;

category: POI category.

social_[paris|ny].parquet: these are the social media post datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 9 - Prepare social media dataset for MAT-Builder.ipynb in our GitHub repository, saved in Parquet format. Each row in the dataframe represents a single social media post, and the dataframe has the following columns:

tweet_ID: identifier of the post;

text: post's text;

tweet_created: post's timestamp;

uid: identifier of the user who posted.

weather_conditions.parquet: these are the weather conditions datasets, ready for ingestion by the MAT-Builder system, outputted by the notebook 7 - Meteostat daily data downloader.ipynb our GitHub repository, saved in Parquet format. Each row in the dataframe represents the weather conditions recorder in a single day, and the dataframe has the following columns:

DATE: date in which the weather observation was recorded;

TAVG_C: average temperature in celsius;

DESCRIPTION: weather conditions.

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:

traj_cleaned.parquet: parquet file storing the dataframe containing the raw preprocessed trajectories after applying the MAT-Builder's preprocessing step on raw_trajectories_[paris|nyc]_matbuilder.parquet. The dataframe contains the same columns found in raw_trajectories_[paris|nyc]_matbuilder.parquet, except for time which in this dataframe has been renamed to datetime. The operations performed in the MAT-Builder's preprocessing step were:

(1) we filtered out trajectories having less than 2 samples;

(2) we filtered noisy samples inducing velocities above 300km/h:

(3) finally, we compressed the trajectories such that all points within a radius of 20 meters from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point.

stops.parquet: parquet file storing the dataframe containing the stop segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific stop segment from some trajectory. The columns are:

datetime, which indicates when a stop segments starts;

leaving_datetime, which indicates when a stop segment ends;

uid, the trajectory user's identifier;

tid, the trajectory's identifier;

lat, the stop segment's centroid latitude;

lng, the stop segment's centroid longitude.
NOTE: to uniquely identify a stop segment, you need the triple (stop segment's index in the dataframe, uid, tid).

moves.parquet: parquet file storing the dataframe containing the samples associated with the move segments detected from the trajectories by the MAT-Builder's segmentation step. Each row in the dataframe represents a specific sample beloning to some move segment of some trajectory. The columns are:

datetime, which indicates when a sample's timestamp;

uid, the samples' trajectory user's identifier;

tid, the sample's trajectory's identifier;

lat, the sample's latitude;

lng, the sample's longitude;

move_id, the identifier of a move segment.
NOTE: to uniquely identify a move segment, you need the triple (uid, tid, move_id).

enriched_occasional.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed occasional and POIs found to be close to their centroids. As such, in this dataframe an occasional stop can appear multiple times, i.e., when the are multiple POIs located nearby a stop's centroid. The columns found in this dataframe are the same from stops.parquet, plus two sets of columns.

The first set of columns concerns a stop's charachteristics:

stop_id, which represents the unique identifier of a stop segment and corresponds to the index of said stop in stops.parquet;

geometry_stop, which is a Shapely Point representing a stop's centroid;

geometry, which is the aforementioned Shapely Point plus a 50 meters buffer around it.

There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:

index_poi, which is the index of the associated POI in the pois.parqet file;

osmid, which is the identifier given by OpenStreetMap to the POI;

name, the POI's name;

wikidata, the POI identifier on wikidata;

category, the POI's category;

geometry_poi, a Shapely (multi)polygon describing the POI's geometry;

distance, the distance between the stop segment's centroid and the POI.

enriched_systematic.parquet: parquet file storing the dataframe containing pairs representing associations between stop segments that have been deemed systematic and POIs found to be close to their centroids. This dataframe has exactly the same characteristics of enriched_occasional.parquet, plus the following columns:

systematic_id, the identifier of the cluster of systematic stops a systematic stop belongs to;

frequency, the number of systematic stops within a systematic stop's cluster;

home, the probability that the systematic stop's cluster represents the home of the associated user;

work, the probability that the systematic stop's cluster represents the workplace of the associated user;

other,
Surface Water - Benthic Macroinvertebrate Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jun 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Benthic Macroinvertebrate Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-benthic-macroinvertebrate-results
Explore at:
pdf, zip, csvAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.

The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Surface Water - Habitat Results
data.cnra.ca.gov
data.ca.gov
csv, pdf, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Habitat Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-habitat-results
Explore at:
pdf, csv, zipAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
P
DataSeeds.AI-Sample-Dataset-DSD Dataset
paperswithcode.com
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). DataSeeds.AI-Sample-Dataset-DSD Dataset [Dataset]. https://paperswithcode.com/dataset/dataseeds-ai-sample-dataset-dsd
Explore at:
Dataset updated
Jun 5, 2025
Description
Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.

Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.

Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process
PSYCHE-D: predicting change in depression severity using person-generated...
zenodo.org
data.niaid.nih.gov
bin, pdf
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
Explore at:
pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5085146
Dataset updated
Jul 18, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
Description
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

Dataset description

Parquet file, with:

35694 rows

154 columns

The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

The data subset used in this work comprises the following:

Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study

Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities

Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month

Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
h
TeXtract_augmented_v1
huggingface.co
Updated Jun 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ToniDO (2025). TeXtract_augmented_v1 [Dataset]. https://huggingface.co/datasets/ToniDO/TeXtract_augmented_v1
Explore at:
Dataset updated
Jun 12, 2025
Authors
ToniDO
Description
Mathematical Expressions Dataset

Dataset Description

This dataset contains images of mathematical expressions along with their corresponding LaTeX code. Images will automatically be displayed as thumbnails in Hugging Face's Data Studio.

Dataset Summary

Number of files: 1 Parquet files Estimated number of samples: 12,312 Format: Parquet optimized for Hugging Face Features configured for thumbnails: ✅ Columns: latex: LaTeX code of the mathematical expression… See the full description on the dataset page: https://huggingface.co/datasets/ToniDO/TeXtract_augmented_v1.
Z
CKW Smart Meter Data
data.niaid.nih.gov
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barahona Garzon, Braulio (2024). CKW Smart Meter Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13304498
Explore at:
Dataset updated
Sep 22, 2024
Dataset authored and provided by
Barahona Garzon, Braulio
Description
Overview

The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.

Data set A - anonimised smart meter data

Data set B - aggregated smart meter data

Contents of this data set

This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet

The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:

id: the anonymized counter ID (text)

timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)

value_kwh: the consumption in kWh in the time window under consideration (float)

In this archive, data from:

| Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |

was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.

A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.

It consists of the following parquet files:

| Size | Date | Name |

|------|------|------|

| 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |

| 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |

| 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |

| 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |

| 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |

| 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |

| 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |

| 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |

| 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |

| 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |

| 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |

| 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |

| 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |

| 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |

| 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |

| 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |

| 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |

| 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |

| 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |

| 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |

| 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |

| 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |

| 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |

| 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |

| 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |

| 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |

| 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |

| 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |

| 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |

| 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |

| 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |

| 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |

| 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |

| 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |

| 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |

| 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |

| 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |

| 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |

| 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |

| 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |

| 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |

| 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |

| 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |

| 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |

| 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |

| 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |

| 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |

| 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |

| 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |

| 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |

| 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |

| 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |

| 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |

| 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |

| 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |

| 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |

| 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |

| 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |

| 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |

| 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |

| 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |

| 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |

| 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |

| 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |

| 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |

where each file contains data from a single smart meter.

Acknowledgement

The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.
Rare Book and Manuscript Library web archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest (2020). Rare Book and Manuscript Library web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3701593
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3701593
Dataset updated
Mar 9, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Rare Book and Manuscript Library collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-2766-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Images

PDFs

Presentation program files

Spreadsheets

Word processor files

The cul-2766-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Feature Engineering Data
kaggle.com
Updated Jul 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mat Leonard (2019). Feature Engineering Data [Dataset]. https://www.kaggle.com/matleonard/feature-engineering-data/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 23, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mat Leonard
Description
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.

For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.

File descriptions

train_sample.csv - Sampled data

Data fields

Each row of the training data contains a click record, with the following features.

ip: ip address of click.

app: app id for marketing.

device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)

os: os version id of user mobile phone

channel: channel id of mobile ad publisher

click_time: timestamp of click (UTC)

attributed_time: if user download the app for after clicking an ad, this is the time of the app download

is_attributed: the target that is to be predicted, indicating the app was downloaded

Note that ip, app, device, os, and channel are encoded.

I'm also including Parquet files with various features for use within the course.
d
Avery Library Historic Preservation and Urban Planning web archive...
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick; Sala, Christine; Thurman, Alex (2023). Avery Library Historic Preservation and Urban Planning web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/Z68EVJ
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/Z68EVJ
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ruest, Nick; Sala, Christine; Thurman, Alex
Description
Web archive derivatives of the Avery Library Historic Preservation and Urban Planning collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1757-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url The cul-1757-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz
U
Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...
data.usgs.gov
gimi9.com
+1more
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. http://doi.org/10.5066/P13DZ7MP
Explore at:
Unique identifier
https://doi.org/10.5066/P13DZ7MP
Dataset updated
Mar 3, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Timothy Stagnitta; Tyler King; Michael Meyer; Brendan Wakefield
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Mar 1, 2013 - Dec 31, 2023
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each dat ...
d
University Archives web archive collection derivatives
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/FONRZU
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/FONRZU
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
Description
Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz
EEL mouse sagittal atlas 168 genes spatial RNA data
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lars Borm (2023). EEL mouse sagittal atlas 168 genes spatial RNA data [Dataset]. http://doi.org/10.6084/m9.figshare.20324814.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20324814.v4
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Lars Borm
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw RNA locations of the mouse atlas produced by EEL FISH for 168 genes.

RNA files are in the .parquet format which can be opened with FISHscale (https://github.com/linnarsson-lab/FISHscale) or any other parquet file reader (https://arrow.apache.org/docs/index.html)

RNA .parquet files Seven sagittal sections of the mouse brain with 168 detected genes, sampled at the medial-lateral positions of -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm and 3600 µm measured from the midline. Position and gene label for all RNA molecules. "c_px_microscope_stitched" contains X coordinates. "r_px_microscope_stitched" contians Y coordinates. The unit are pixels with a size of 0.18 micrometer. Multiply by 0.18 to get um scale. "Valid" Boolean column where a 1 means that the molecule is detected inside the tissue section. A zero means the molecule is detected outside.

Tissue polygons .csv files CSV files demarking the sample borders for the 7 mouse atlas sections. -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm, 3600 µm. These polygons were used to generate the "Valid" column. If you want to make your own selection please have a look at the code in: https://github.com/linnarsson-lab/FISHscale/blob/master/FISHscale/utils/inside_polygon.py

Gene colors .pkl file Pickled Python dictionary with gene colors used in the paper for the mouse atlas.

Fuτure - dataset for studies, development, and training of algorithms for...

zenodo.org

bin

Updated Oct 3, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata (2024). Fuτure - dataset for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons [Dataset]. http://doi.org/10.5281/zenodo.13881061

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13881061

Dataset updated

Oct 3, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Laurits Tani; Laurits Tani; Joosep Pata; Joosep Pata

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data description

MC Simulation

The Fuτure dataset is intended for studies, development, and training of algorithms for reconstructing and identifying hadronically decaying tau leptons. The dataset is generated with Pythia 8, with the full detector simulation being performed by Geant4 with the CLIC-like detector setup CLICdet (CLIC_o3_v14) setup. Events are reconstructed using the Marlin reconstruction framework and interfaced with Key4HEP. Particle candidates in the reconstructed events are reconstructed using the PandoraPF algorithm.

In this version of the dataset no γγ -> hadrons background is included.

Samples

This dataset contains e+e- samples with Z->ττ, ZH,H->ττ and Z->qq events, with approximately 2 million events simulated in each category.

The following processes e+e- were simulated with Pythia 8 at sqrt(s) = 380 GeV:

p8_ee_qq_ecm380 [Z -> qq events]
p8_ee_ZH_Htautau [ZH -> Ztautau]
p8_ee_Z_Ztautau_ecm380 [ZH -> Ztautau]

The .root files from the MC simulation chain are eventually processed by the software found in Github in order to create flat ntuples as the final product.

Features

The basis of the ntuples are the particle flow (PF) candidates from PandoraPF. Each PF candidate has four momenta, charge and particle label (electron / muon / photon / charged hadron / neutral hadron). The PF candidates in a given event are clustered into jets using generalized kt algorithm for ee collisions, with parameters p=-1 and R=0.4. The minimum pT is set to be 0 GeV for both generator level jets and reconstructed jets. The dataset contains the four momenta of the jets, with the PF candidates in the jets with the above listed properties.

Additionally, a set of variables describing the tau lifetime are calculated using the software in Github. As tau lifetime is very short, these variables are sensitive to true tau decays. In the calculation of these lifetime variables, we use a linear approximation.

In summary, the features found in the flat ntuples are:

Name	Description
reco_cand_p4s	4-momenta per particle in the reco jet.
reco_cand_charge	Charge per particle in the jet.
reco_cand_pdg	PDGid per particle in the jet.
reco_jet_p4s	RecoJet 4-momenta.
reco_cand_dz	Longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dz_err	Uncertainty of the longitudinal impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy	Transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
reco_cand_dxy_err	Uncertainty of the transverse impact parameter per particle in the jet. For future steps. Fill value used for neutral particles as no track parameters can be calculated.
gen_jet_p4s	GenJet 4-momenta. Matched with RecoJet within a cone of radius dR < 0.3.
gen_jet_tau_decaymode	Decay mode of the associated genTau. Jets that have associated leptonically decaying taus are removed, so there are no DM=16 jets. If no GenTau can be matched to GenJet within dR < 0.4, a fill value is used.
gen_jet_tau_p4s	Visible 4-momenta of the genTau. If no GenTau can be matched to GenJet within dR<0.4, a fill value is used.

The ground truth is based on stable particles at the generator level, before detector simulation. These particles are clustered into generator-level jets and are matched to generator-level τ leptons as well as reconstructed jets. In order for a generator-level jet to be matched to generator-level τ lepton, the τ lepton needs to be inside a cone of dR = 0.4. The same applies for the reconstructed jet, with the requirement on dR being set to dR = 0.3. For each reconstructed jet, we define three target values related to τ lepton reconstruction:

a binary flag isTau if it was matched to a generator-level hadronically decaying τ lepton. gen_jet_tau_decaymode of value -1 indicates no match to generator-level hadronically decaying τ.
the categorical decay mode of the τ gen_jet_tau_decaymode in terms of the number of generator level charged and neutral hadrons. Possible gen_jet_tau_decaymode are {0, 1, . . . , 15}.
if matched, the visible (neglecting neutrinos), reconstructable pT of the τ lepton. This is inferred from the gen_jet_tau_p4s

Dataset characteristics

File	# Jets	Size
z_test.parquet	870 843	171 MB
z_train.parquet	3 483 369	681 MB
zh_test.parquet	1 068 606	213 MB
zh_train.parquet	4 274 423	851 MB
qq_test.parquet	6 366 715	1.4 GB
qq_train.parquet	25 466 858	5.6 GB

The dataset consists of 6 files of 8.9 GB in total.

How can you use these data?

The .parquet files can be directly loaded with the Awkward Array Python library.
An example how one might use the dataset and the features is given in data_intro.ipynb

o
GitTables 1M
explore.openaire.eu
Updated May 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M [Dataset]. http://doi.org/10.5281/zenodo.6517052
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6517052
Dataset updated
May 3, 2022
Authors
Madelon Hulsebos; Çağatay Demiralp; Paul Groth
Description
Summary GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables. Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions. We believe GitTables can facilitate many use-cases, among which: Data integration, search and validation. Data visualization and analysis recommendation. Schema analysis and completion for e.g. database or knowledge base design. If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets). Dataset contents The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4. In summary, this dataset can be characterized as follows: Statistic Value # tables 1M average # columns 12 average # rows 142 # annotated tables (at least 1 column annotation) 723K+ (DBpedia), 738K+ (Schema.org) # unique semantic types 835 (DBpedia), 677 (Schema.org) How to download The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download). Future releases Future releases will include the following: Increased number of tables (expected at least 10M) Associated datasets - GitTables benchmark - column type detection: https://zenodo.org/record/5706316 - GitTables 1M - CSV files: https://zenodo.org/record/6515973
d
HERO WEC 2024 Hydraulic Configuration Deployment Data
catalog.data.gov
mhkdr.openei.org
+1more
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2025). HERO WEC 2024 Hydraulic Configuration Deployment Data [Dataset]. https://catalog.data.gov/dataset/hero-wec-2024-hydraulic-configuration-deployment-data-501bc
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
National Renewable Energy Laboratory
Description
The following submission includes raw and processed data from the in water deployment of NREL's Hydraulic and Electric Reverse Osmosis Wave Energy Converter (HERO WEC), in the form of parquet files, TDMS files, CSV files, bag files and MATLAB workspaces. This dataset was collected in March 2024 at the Jennette's pier test site in North Carolina. This submission includes the following: Data description document (HERO WEC FY24 Hydraulic Deployment Data Descriptions.doc) - This document includes detailed descriptions of the type of data and how it was processed and/or calculated. Processed MATLAB workspace - The processed data is provided in the form of a single MATLAB workspace containing data from the full deployment. This workspace contains data from all sensors down sampled to 10 Hz along with all array Value Added Products (VAPs). MATLAB visualization scripts - The MATLAB workspaces can be visualized using the file "HERO_WEC_2024_Hydraulic_Config_Data_Viewer.m/mlx". The user simply needs to download the processed MATLAB workspaces, specify the desired start and end times and run this file. Both the .m and .mlx file format has been provided depending on the user's preference. Summary Data - The fully processed data was used to create a summary data set with averages and important calculations performed on 30-minute intervals to align with the intervals of wave resource data reported from nearby CDIP ocean observing buoys located 20km East of Jennette's pier and 40km Northeast of Jennette's pier. The wave resource data provided in this data set is to be used for reference only due the difference in water depth and proximity to shore between the Jennette's pier test site and the locations of the ocean observing buoys. This data is provided in the Summary Data zip folder, which includes this data set in the form of a MATLAB workspace, parquet file, and excel spreadsheet. Processed Parquet File - The processed data is provided in the form of a single parquet file containing data from all HERO WEC sensors collected during the full deployment. Data in these files has been down sampled to 10 Hz and all array VAPs are included. Interim Filtered Data - Raw data from each sensor group partitioned into 30-minute parquet files. These files are outputs from an intermediate stage of data processing and contain the raw data with no Quality Control (QC) or calculations performed in a format that is easier to use than the raw data. Raw Data - Raw, unprocessed data from this deployment can be found in the Raw Data zip folder. This data is provided in the form of TDMS, CSV, and bag files in the original format output by the MODAQ system. Python Data Processing Script - This links to an NREL public github repository containing the python script used to go from raw data to fully processed parquet files. Additional documentation on how to use this script is included in the github repository. This data set has been developed by the National Renewable Energy Laboratory, operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. Funding provided by the U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Water Power Technologies Office.
Freely Accessible eJournals web archive collection derivatives
zenodo.org
explore.openaire.eu
+1more
application/gzip
Updated Feb 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest (2020). Freely Accessible eJournals web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633671
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3633671
Dataset updated
Feb 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Geologic Field Trip Guidebooks Web Archive collection derivatives
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Feb 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams (2020). Geologic Field Trip Guidebooks Web Archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3666295
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3666295
Dataset updated
Feb 13, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams; Amanda Bielskas; Brittany Wofford; Jane Quigley; Emily Wild; Samantha Abrams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the collection Geologic Field Trip Guidebooks Web Archive from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-12576-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-12576-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.

Facebook

Twitter

Click to copy link

Link copied

Cite

New York Taxi and Limousine Commission (2017). New York Taxi Data 2009-2016 in Parquet Fomat [Dataset]. https://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e

New York Taxi Data 2009-2016 in Parquet Fomat

Explore at:

bittorrent(35078948106)Available download formats

Dataset updated

Jul 1, 2017

Dataset provided by

New York City Taxi and Limousine Commissionhttp://www.nyc.gov/tlc

Authors

New York Taxi and Limousine Commission

License

https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

Area covered

New York

Description

Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar

Clear search

Close search

Google apps

Main menu

New York Taxi Data 2009-2016 in Parquet Fomat

Surface Water - Habitat Results

Datasets of the CIKM resource paper "A Semantically Enriched Mobility...

Input data

Output data: the semantically enriched Paris and New York City datasets

Tabular Representation

Surface Water - Benthic Macroinvertebrate Results

Surface Water - Habitat Results

DataSeeds.AI-Sample-Dataset-DSD Dataset

PSYCHE-D: predicting change in depression severity using person-generated...

TeXtract_augmented_v1

CKW Smart Meter Data

Rare Book and Manuscript Library web archive collection derivatives

Feature Engineering Data

File descriptions

Data fields

Avery Library Historic Preservation and Urban Planning web archive...

Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8...

University Archives web archive collection derivatives

EEL mouse sagittal atlas 168 genes spatial RNA data

Fuτure - dataset for studies, development, and training of algorithms for...

Data description

MC Simulation

Samples

Features

Contents:

Dataset characteristics

How can you use these data?

GitTables 1M

HERO WEC 2024 Hydraulic Configuration Deployment Data

Freely Accessible eJournals web archive collection derivatives

Geologic Field Trip Guidebooks Web Archive collection derivatives

New York Taxi Data 2009-2016 in Parquet Fomat