https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
deepghs/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community
ethix/example-space-to-dataset-parquet dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the two semantically enriched trajectory datasets introduced in the CIKM Resource Paper "A Semantically Enriched Mobility Dataset with Contextual and Social Dimensions", by Chiara Pugliese (CNR-IIT), Francesco Lettich (CNR-ISTI), Guido Rocchietti (CNR-ISTI), Chiara Renso (CNR-ISTI), and Fabio Pinelli (IMT Lucca, CNR-ISTI).
The two datasets were generated with an open source pipeline based on the Jupyter notebooks published in the GitHub repository behind our resource paper, and our MAT-Builder system. Overall, our pipeline first generates the files that we provide in the [paris|nyc]_input_matbuilder.zip archives; the files are then passed as input to the MAT-Builder system, which ultimately generates the two semantically enriched trajectory datasets for Paris and New York City, both in tabular and RDF formats. For more details on the input and output data, please see the sections below.
The [paris|nyc]_input_matbuilder.zip archives contain the data sources we used with the MAT-Builder system to semantically enrich raw preprocessed trajectories. More specifically, the archives contain the following files:
The [paris|nyc]_output_tabular.zip zip archives contain the output files generated by MAT-Builder that express the semantically enriched Paris and New York City datasets in tabular format. More specifically, they contain the following files:
There is then a second set of columns which represents the characteristics of the POI that has been associated with a stop. The relevant ones are:
This data provides results from field analyses, from the California Environmental Data Exchange Network (CEDEN). The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Data collected for marine benthic infauna, freshwater benthic macroinvertebrate (BMI), algae, bacteria and diatom taxonomic analyses, from the California Environmental Data Exchange Network (CEDEN). Note bacteria single species concentrations are stored within the chemistry template, whereas abundance bacteria are stored within this set. Each record represents a result from a specific event location for a single organism in a single sample.
The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Zip files are provided for bulk data downloads (in csv or parquet file format), and developers can use the API associated with the "CEDEN Benthic Data" (csv) resource to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.
Dataset description
Parquet file, with:
The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.
Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.
File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.
The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.
The data subset used in this work comprises the following:
From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).
The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.
Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.
Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Pashto Synthetic Speech Dataset Parquet (20k)
This dataset contains 40000 synthetic speech recordings in the Pashto language, with 20000 male voice recordings and 20000 female voice recordings, stored in Parquet format.
Dataset Information
Dataset Size: 20000 sentences Total Recordings: 40000 audio files (20000 male + 20000 female) Audio Format: WAV, 24kHz, 16-bit PCM, embedded directly in Parquet files Dataset Format: Parquet with 500MB shards Sampling Rate: 24kHz… See the full description on the dataset page: https://huggingface.co/datasets/ihanif/pashto_speech_20k.
Overview
The CKW Group is a distribution system operator that supplies more than 200,000 end customers in Central Switzerland. Since October 2022, CKW publishes anonymised and aggregated data from smart meters that measure electricity consumption in canton Lucerne. This unique dataset is accessible in the ckw.ch/opendata platform.
Data set A - anonimised smart meter data
Data set B - aggregated smart meter data
Contents of this data set
This data set contains a small sample of the CKW data set A sorted per smart meter ID, stored as parquet files named with the id field of the corresponding smart meter anonymised data. Example: 027ceb7b8fd77a4b11b3b497e9f0b174.parquet
The orginal CKW data is available for download at https://open.data.axpo.com/%24web/index.html#dataset-a as a (gzip-compressed) csv files, which are are split into one file per calendar month. The columns in the files csv are:
id: the anonymized counter ID (text)
timestamp: the UTC time at the beginning of a 15-minute time window to which the consumption refers (ISO-8601 timestamp)
value_kwh: the consumption in kWh in the time window under consideration (float)
In this archive, data from:
| Dateigrösse | Export Datum | Zeitraum | Dateiname || ----------- | ------------ | -------- | --------- || 4.2GiB | 2024-04-20 | 202402 | ckw_opendata_smartmeter_dataset_a_202402.csv.gz || 4.5GiB | 2024-03-21 | 202401 | ckw_opendata_smartmeter_dataset_a_202401.csv.gz || 4.5GiB | 2024-02-20 | 202312 | ckw_opendata_smartmeter_dataset_a_202312.csv.gz || 4.4GiB | 2024-01-20 | 202311 | ckw_opendata_smartmeter_dataset_a_202311.csv.gz || 4.5GiB | 2023-12-20 | 202310 | ckw_opendata_smartmeter_dataset_a_202310.csv.gz || 4.4GiB | 2023-11-20 | 202309 | ckw_opendata_smartmeter_dataset_a_202309.csv.gz || 4.5GiB | 2023-10-20 | 202308 | ckw_opendata_smartmeter_dataset_a_202308.csv.gz || 4.6GiB | 2023-09-20 | 202307 | ckw_opendata_smartmeter_dataset_a_202307.csv.gz || 4.4GiB | 2023-08-20 | 202306 | ckw_opendata_smartmeter_dataset_a_202306.csv.gz || 4.6GiB | 2023-07-20 | 202305 | ckw_opendata_smartmeter_dataset_a_202305.csv.gz || 3.3GiB | 2023-06-20 | 202304 | ckw_opendata_smartmeter_dataset_a_202304.csv.gz || 4.6GiB | 2023-05-24 | 202303 | ckw_opendata_smartmeter_dataset_a_202303.csv.gz || 4.2GiB | 2023-04-20 | 202302 | ckw_opendata_smartmeter_dataset_a_202302.csv.gz || 4.7GiB | 2023-03-20 | 202301 | ckw_opendata_smartmeter_dataset_a_202301.csv.gz || 4.6GiB | 2023-03-15 | 202212 | ckw_opendata_smartmeter_dataset_a_202212.csv.gz || 4.3GiB | 2023-03-15 | 202211 | ckw_opendata_smartmeter_dataset_a_202211.csv.gz || 4.4GiB | 2023-03-15 | 202210 | ckw_opendata_smartmeter_dataset_a_202210.csv.gz || 4.3GiB | 2023-03-15 | 202209 | ckw_opendata_smartmeter_dataset_a_202209.csv.gz || 4.4GiB | 2023-03-15 | 202208 | ckw_opendata_smartmeter_dataset_a_202208.csv.gz || 4.4GiB | 2023-03-15 | 202207 | ckw_opendata_smartmeter_dataset_a_202207.csv.gz || 4.2GiB | 2023-03-15 | 202206 | ckw_opendata_smartmeter_dataset_a_202206.csv.gz || 4.3GiB | 2023-03-15 | 202205 | ckw_opendata_smartmeter_dataset_a_202205.csv.gz || 4.2GiB | 2023-03-15 | 202204 | ckw_opendata_smartmeter_dataset_a_202204.csv.gz || 4.1GiB | 2023-03-15 | 202203 | ckw_opendata_smartmeter_dataset_a_202203.csv.gz || 3.5GiB | 2023-03-15 | 202202 | ckw_opendata_smartmeter_dataset_a_202202.csv.gz || 3.7GiB | 2023-03-15 | 202201 | ckw_opendata_smartmeter_dataset_a_202201.csv.gz || 3.5GiB | 2023-03-15 | 202112 | ckw_opendata_smartmeter_dataset_a_202112.csv.gz || 3.1GiB | 2023-03-15 | 202111 | ckw_opendata_smartmeter_dataset_a_202111.csv.gz || 3.0GiB | 2023-03-15 | 202110 | ckw_opendata_smartmeter_dataset_a_202110.csv.gz || 2.7GiB | 2023-03-15 | 202109 | ckw_opendata_smartmeter_dataset_a_202109.csv.gz || 2.6GiB | 2023-03-15 | 202108 | ckw_opendata_smartmeter_dataset_a_202108.csv.gz || 2.4GiB | 2023-03-15 | 202107 | ckw_opendata_smartmeter_dataset_a_202107.csv.gz || 2.1GiB | 2023-03-15 | 202106 | ckw_opendata_smartmeter_dataset_a_202106.csv.gz || 2.0GiB | 2023-03-15 | 202105 | ckw_opendata_smartmeter_dataset_a_202105.csv.gz || 1.7GiB | 2023-03-15 | 202104 | ckw_opendata_smartmeter_dataset_a_202104.csv.gz || 1.6GiB | 2023-03-15 | 202103 | ckw_opendata_smartmeter_dataset_a_202103.csv.gz || 1.3GiB | 2023-03-15 | 202102 | ckw_opendata_smartmeter_dataset_a_202102.csv.gz || 1.3GiB | 2023-03-15 | 202101 | ckw_opendata_smartmeter_dataset_a_202101.csv.gz |
was processed into partitioned parquet files, and then organised by id into parquet files with data from single smart meters.
A small sample of all the smart meters data above, are archived in the cloud public cloud space of AISOP project https://os.zhdk.cloud.switch.ch/swift/v1/aisop_public/ckw/ts/batch_0424/batch_0424.zip and also here is this public record. For access to the complete data contact the authors of this archive.
It consists of the following parquet files:
| Size | Date | Name |
|------|------|------|
| 1.0M | Mar 4 12:18 | 027ceb7b8fd77a4b11b3b497e9f0b174.parquet |
| 979K | Mar 4 12:18 | 03a4af696ff6a5c049736e9614f18b1b.parquet |
| 1.0M | Mar 4 12:18 | 03654abddf9a1b26f5fbbeea362a96ed.parquet |
| 1.0M | Mar 4 12:18 | 03acebcc4e7d39b6df5c72e01a3c35a6.parquet |
| 1.0M | Mar 4 12:18 | 039e60e1d03c2afd071085bdbd84bb69.parquet |
| 931K | Mar 4 12:18 | 036877a1563f01e6e830298c193071a6.parquet |
| 1.0M | Mar 4 12:18 | 02e45872f30f5a6a33972e8c3ba9c2e5.parquet |
| 662K | Mar 4 12:18 | 03a25f298431549a6bc0b1a58eca1f34.parquet |
| 635K | Mar 4 12:18 | 029a46275625a3cefc1f56b985067d15.parquet |
| 1.0M | Mar 4 12:18 | 0301309d6d1e06c60b4899061deb7abd.parquet |
| 1.0M | Mar 4 12:18 | 0291e323d7b1eb76bf680f6e800c2594.parquet |
| 1.0M | Mar 4 12:18 | 0298e58930c24010bbe2777c01b7644a.parquet |
| 1.0M | Mar 4 12:18 | 0362c5f3685febf367ebea62fbc88590.parquet |
| 1.0M | Mar 4 12:18 | 0390835d05372cb66f6cd4ca662399e8.parquet |
| 1.0M | Mar 4 12:18 | 02f670f059e1f834dfb8ba809c13a210.parquet |
| 987K | Mar 4 12:18 | 02af749aaf8feb59df7e78d5e5d550e0.parquet |
| 996K | Mar 4 12:18 | 0311d3c1d08ee0af3edda4dc260421d1.parquet |
| 1.0M | Mar 4 12:18 | 030a707019326e90b0ee3f35bde666e0.parquet |
| 955K | Mar 4 12:18 | 033441231b277b283191e0e1194d81e2.parquet |
| 995K | Mar 4 12:18 | 0317b0417d1ec91b5c243be854da8a86.parquet |
| 1.0M | Mar 4 12:18 | 02ef4e49b6fb50f62a043fb79118d980.parquet |
| 1.0M | Mar 4 12:18 | 0340ad82e9946be45b5401fc6a215bf3.parquet |
| 974K | Mar 4 12:18 | 03764b3b9a65886c3aacdbc85d952b19.parquet |
| 1.0M | Mar 4 12:18 | 039723cb9e421c5cbe5cff66d06cb4b6.parquet |
| 1.0M | Mar 4 12:18 | 0282f16ed6ef0035dc2313b853ff3f68.parquet |
| 1.0M | Mar 4 12:18 | 032495d70369c6e64ab0c4086583bee2.parquet |
| 900K | Mar 4 12:18 | 02c56641571fc9bc37448ce707c80d3d.parquet |
| 1.0M | Mar 4 12:18 | 027b7b950689c337d311094755697a8f.parquet |
| 1.0M | Mar 4 12:18 | 02af272adccf45b6cdd4a7050c979f9f.parquet |
| 927K | Mar 4 12:18 | 02fc9a3b2b0871d3b6a1e4f8fe415186.parquet |
| 1.0M | Mar 4 12:18 | 03872674e2a78371ce4dfa5921561a8c.parquet |
| 881K | Mar 4 12:18 | 0344a09d90dbfa77481c5140bb376992.parquet |
| 1.0M | Mar 4 12:18 | 0351503e2b529f53bdae15c7fbd56fc0.parquet |
| 1.0M | Mar 4 12:18 | 033fe9c3a9ca39001af68366da98257c.parquet |
| 1.0M | Mar 4 12:18 | 02e70a1c64bd2da7eb0d62be870ae0d6.parquet |
| 1.0M | Mar 4 12:18 | 0296385692c9de5d2320326eaa000453.parquet |
| 962K | Mar 4 12:18 | 035254738f1cc8a31075d9fbe3ec2132.parquet |
| 991K | Mar 4 12:18 | 02e78f0d6a8fb96050053e188bf0f07c.parquet |
| 1.0M | Mar 4 12:18 | 039e4f37ed301110f506f551482d0337.parquet |
| 961K | Mar 4 12:18 | 039e2581430703b39c359dc62924a4eb.parquet |
| 999K | Mar 4 12:18 | 02c6f7e4b559a25d05b595cbb5626270.parquet |
| 1.0M | Mar 4 12:18 | 02dd91468360700a5b9514b109afb504.parquet |
| 938K | Mar 4 12:18 | 02e99c6bb9d3ca833adec796a232bac0.parquet |
| 589K | Mar 4 12:18 | 03aef63e26a0bdbce4a45d7cf6f0c6f8.parquet |
| 1.0M | Mar 4 12:18 | 02d1ca48a66a57b8625754d6a31f53c7.parquet |
| 1.0M | Mar 4 12:18 | 03af9ebf0457e1d451b83fa123f20a12.parquet |
| 1.0M | Mar 4 12:18 | 0289efb0e712486f00f52078d6c64a5b.parquet |
| 1.0M | Mar 4 12:18 | 03466ed913455c281ffeeaa80abdfff6.parquet |
| 1.0M | Mar 4 12:18 | 032d6f4b34da58dba02afdf5dab3e016.parquet |
| 1.0M | Mar 4 12:18 | 03406854f35a4181f4b0778bb5fc010c.parquet |
| 1.0M | Mar 4 12:18 | 0345fc286238bcea5b2b9849738c53a2.parquet |
| 1.0M | Mar 4 12:18 | 029ff5169155b57140821a920ad67c7e.parquet |
| 985K | Mar 4 12:18 | 02e4c9f3518f079ec4e5133acccb2635.parquet |
| 1.0M | Mar 4 12:18 | 03917c4f2aef487dc20238777ac5fdae.parquet |
| 969K | Mar 4 12:18 | 03aae0ab38cebcb160e389b2138f50da.parquet |
| 914K | Mar 4 12:18 | 02bf87b07b64fb5be54f9385880b9dc1.parquet |
| 1.0M | Mar 4 12:18 | 02776685a085c4b785a3885ef81d427a.parquet |
| 947K | Mar 4 12:18 | 02f5a82af5a5ffac2fe7551bf4a0a1aa.parquet |
| 992K | Mar 4 12:18 | 039670174dbc12e1ae217764c96bbeb3.parquet |
| 1.0M | Mar 4 12:18 | 037700bf3e272245329d9385bb458bac.parquet |
| 602K | Mar 4 12:18 | 0388916cdb86b12507548b1366554e16.parquet |
| 939K | Mar 4 12:18 | 02ccbadea8d2d897e0d4af9fb3ed9a8e.parquet |
| 1.0M | Mar 4 12:18 | 02dc3f4fb7aec02ba689ad437d8bc459.parquet |
| 1.0M | Mar 4 12:18 | 02cf12e01cd20d38f51b4223e53d3355.parquet |
| 993K | Mar 4 12:18 | 0371f79d154c00f9e3e39c27bab2b426.parquet |
where each file contains data from a single smart meter.
Acknowledgement
The AISOP project (https://aisopproject.com/) received funding in the framework of the Joint Programming Platform Smart Energy Systems from European Union's Horizon 2020 research and innovation programme under grant agreement No 883973. ERA-Net Smart Energy Systems joint call on digital transformation for green energy transition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw RNA locations of the mouse atlas produced by EEL FISH for 168 genes.
RNA files are in the .parquet format which can be opened with FISHscale (https://github.com/linnarsson-lab/FISHscale) or any other parquet file reader (https://arrow.apache.org/docs/index.html)
RNA .parquet files Seven sagittal sections of the mouse brain with 168 detected genes, sampled at the medial-lateral positions of -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm and 3600 µm measured from the midline. Position and gene label for all RNA molecules. "c_px_microscope_stitched" contains X coordinates. "r_px_microscope_stitched" contians Y coordinates. The unit are pixels with a size of 0.18 micrometer. Multiply by 0.18 to get um scale. "Valid" Boolean column where a 1 means that the molecule is detected inside the tissue section. A zero means the molecule is detected outside.
Tissue polygons .csv files CSV files demarking the sample borders for the 7 mouse atlas sections. -140 µm, 600 µm, 1200 µm, 1810 µm, 2420 µm, 3000 µm, 3600 µm. These polygons were used to generate the "Valid" column. If you want to make your own selection please have a look at the code in: https://github.com/linnarsson-lab/FISHscale/blob/master/FISHscale/utils/inside_polygon.py
Gene colors .pkl file Pickled Python dictionary with gene colors used in the paper for the mouse atlas.
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one.
The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.
Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.
Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.
Web archive derivatives of the Avery Library Historic Preservation and Urban Planning collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1757-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url The cul-1757-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1757-parquet.tar.gz.part* > cul-1757-parquet.tar.gz
This dataset is a sample from the TalkingData AdTracking competition. I kept all the positive examples (where is_attributed == 1
), while discarding 99% of the negative samples. The sample has roughly 20% positive examples.
For this competition, your objective was to predict whether a user will download an app after clicking a mobile app advertisement.
train_sample.csv
- Sampled data
Each row of the training data contains a click record, with the following features.
ip
: ip address of click.app
: app id for marketing.device
: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)os
: os version id of user mobile phonechannel
: channel id of mobile ad publisherclick_time
: timestamp of click (UTC)attributed_time
: if user download the app for after clicking an ad, this is the time of the app downloadis_attributed
: the target that is to be predicted, indicating the app was downloadedNote that ip, app, device, os, and channel are encoded.
I'm also including Parquet files with various features for use within the course.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
This data set provides results of tissue from organisms found in surface waters, from the California Environmental Data Exchange Network (CEDEN). The data are of tissue from individual organisms and of composite samples where tissue samples from multiple organisms are combined and then analyzed. Both the individual samples and the composite sample results may be given so for individual samples, there will be a row for the individual sample and a row for the composite where the number per composite is one. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result. Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data. Example R code using the API to access data across all years can be found here. Users who want to manually download more specific subsets of the data can also use the CEDEN query tool, at: https://ceden.waterboards.ca.gov/AdvancedQueryTool
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each dat ...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was generated from skihikingkevin/pubg-match-deaths. It only consists of matches where at least one player has player more than 1 game (in different matches). The data was processed using polars and converted from CSV to Parquet files. A random sample was performed (groupwise) to produce an 80/20 split.
Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Trip record data from the Taxi and Limousine Commission () from January 2009-December 2016 was consolidated and brought into a consistent Parquet format by Ravi Shekhar