https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This GPS trajectory dataset was collected in (Microsoft Research) Geolife project by 178 users in a period of over four years (from April 2007 to October 2011). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of 1,251,654 kilometers and a total duration of 48,203 hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.
This dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling.
Data Format - Trajectory file Every single folder of this dataset stores a user’s GPS log files, which were converted to PLT format. Each PLT file contains a single trajectory and is named by its starting time. To avoid potential confusion of time zone, we use GMT in the date/time property of each point, which is different from our previous release. - PLT format: Line 1…6 are useless in this dataset, and can be ignored. Points are described in following lines, one for each line. Field 1: Latitude in decimal degrees. Field 2: Longitude in decimal degrees. Field 3: All set to 0 for this dataset. Field 4: Altitude in feet (-777 if not valid). Field 5: Date - number of days (with fractional part) that have passed since 12/30/1899. Field 6: Date as a string. Field 7: Time as a string. Note that field 5 and field 6&7 represent the same date/time in this dataset. You may use either of them. Example: 39.906631,116.385564,0,492,40097.5864583333,2009-10-11,14:04:30 39.906554,116.385625,0,492,40097.5865162037,2009-10-11,14:04:35 - Transportation mode labels Possible transportation modes are: walk, bike, bus, car, subway, train, airplane, boat, run and motorcycle. Again, we have converted the date/time of all labels to GMT, even though most of them were created in China. Example: Start Time End TimeTransportation Mode 2008/04/02 11:24:21 2008/04/02 11:50:45 bus 2008/04/03 01:07:03 2008/04/03 11:31:55 train 2008/04/03 11:32:24 2008/04/03 11:46:14 walk 2008/04/03 11:47:14 2008/04/03 11:55:07 car
First, you can regard the label of both taxi and car as driving although we set them with different labels for future usage. Second, a user could label the transportation mode of a light rail as train while others may use subway as the label. Actually, no trajectory can be recorded in an underground subway system since a GPS logger cannot receive any signal there. In Beijing, the light rails and subway systems are seamlessly connected, e.g., line 13 (a light rail) is connected with line 10 and line 2, which are subway systems. Sometimes, a line (like line 5) is comprised of partial subways and partial light rails. So, users may have a variety of understanding in their transportation modes. You can differentiate the real train trajectories (connecting two cities) from the light rail trajectory (generating in a city) according to their distances. Or, just treat them the same.
More: User Guide: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/User20Guide-1.2.pdf
Please cite the following papers when using this GPS dataset. [1] Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma. Mining interesting locations and travel sequences from GPS trajectories. In Proceedings of International conference on World Wild Web (WWW 2009), Madrid Spain. ACM Press: 791-800.
[2] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie, Wei-Ying Ma. Understanding Mobility Based on GPS Data. In Proceedings of ACM conference on Ubiquitous Computing (UbiComp 2008), Seoul, Korea. ACM Press: 312-321. [3] Yu Zheng, Xing Xie, Wei-Ying Ma, GeoLife: A Collaborative Social Networking Service among User, location and trajectory. Invited paper, in IEEE Data Engineering Bulletin. 33, 2, 2010, pp. 32-40.
This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Muis et al Coastal Flood datasets These datasets present the first global reanalysis of storm surges and extreme sea levels (GTSR data set) based on hydrodynamic modelling. GTSR covers the entire world’s coastline and consists of time series of tides and surges, and estimates of extreme sea levels. Validation shows that there is good agreement between modelled and observed sea levels, and that the performance of GTSR is similar to that of many regional hydrodynamic models. More information on the methods can be found at: Muis, S., Verlaan, M., Winsemius, H.C., Aerts, J.C.J.H., Ward, P.J., 2016. A global reanalysis of storm surge and extreme sea levels. Nat. Commun. 7, 1–11. doi:10.1038/ncomms11969
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This unique and huge data set contains plant information for the Himalaya Uplands; it consists of 164,360 records. This database is implemented in MS ACCESS following ABCD 1.2. It describes Asian plant species related to the Tibetan Plateau, Central Asia. Data have been collected for over 50 years, and in over 11 countries (e.g. Afghanistan, Pakistan, Bhutan, China,India, Kazakhstan, Kyrgyztan, Myanmar, Nepal, Russia, Tajikistan, Turkmenistan, Uzbekistan), covering over 220 national regions. Taxonomic information for this region is diverse and not well studied. However, the database follows ICBN taxonomy matched with ITIS and consists of over 5,562 unique species entries. From these, ITIS has 996 species listed. Over 2,200 collectors from all over the world contributed to this dataset, which mostly was compiled and maintained by the author for over 20 years. This database covers 21,869 localities. virtually all sites are georeferenced with latitude and longitude (2 decimals; geographic datum of WGS84), and 6,668 of such unique locations are found in the HUP database. This dataset has altitude information provided by the fieldworker.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data in this dataset were collected in the result of the survey of Latvian society (2021) aimed at identifying high-value data set for Latvia, i.e. data sets that, in the view of Latvian society, could create the value for the Latvian economy and society. The survey is created for both individuals and businesses. It being made public both to act as supplementary data for "Towards enrichment of the open government data: a stakeholder-centered determination of High-Value Data sets for Latvia" paper (author: Anastasija Nikiforova, University of Latvia) and in order for other researchers to use these data in their own work.
The survey was distributed among Latvian citizens and organisations. The structure of the survey is available in the supplementary file available (see Survey_HighValueDataSets.odt)
Description of the data in this data set: structure of the survey and pre-defined answers (if any) 1. Have you ever used open (government) data? - {(1) yes, once; (2) yes, there has been a little experience; (3) yes, continuously, (4) no, it wasn’t needed for me; (5) no, have tried but has failed} 2. How would you assess the value of open govenment data that are currently available for your personal use or your business? - 5-point Likert scale, where 1 – any to 5 – very high 3. If you ever used the open (government) data, what was the purpose of using them? - {(1) Have not had to use; (2) to identify the situation for an object or ab event (e.g. Covid-19 current state); (3) data-driven decision-making; (4) for the enrichment of my data, i.e. by supplementing them; (5) for better understanding of decisions of the government; (6) awareness of governments’ actions (increasing transparency); (7) forecasting (e.g. trendings etc.); (8) for developing data-driven solutions that use only the open data; (9) for developing data-driven solutions, using open data as a supplement to existing data; (10) for training and education purposes; (11) for entertainment; (12) other (open-ended question) 4. What category(ies) of “high value datasets” is, in you opinion, able to create added value for society or the economy? {(1)Geospatial data; (2) Earth observation and environment; (3) Meteorological; (4) Statistics; (5) Companies and company ownership; (6) Mobility} 5. To what extent do you think the current data catalogue of Latvia’s Open data portal corresponds to the needs of data users/ consumers? - 10-point Likert scale, where 1 – no data are useful, but 10 – fully correspond, i.e. all potentially valuable datasets are available 6. Which of the current data categories in Latvia’s open data portals, in you opinion, most corresponds to the “high value dataset”? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 7. Which of them form your TOP-3? - {(1)Foreign affairs; (2) business econonmy; (3) energy; (4) citizens and society; (5) education and sport; (6) culture; (7) regions and municipalities; (8) justice, internal affairs and security; (9) transports; (10) public administration; (11) health; (12) environment; (13) agriculture, food and forestry; (14) science and technologies} 8. How would you assess the value of the following data categories? 8.1. sensor data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.2. real-time data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 8.3. geospatial data - 5-point Likert scale, where 1 – not needed to 5 – highly valuable 9. What would be these datasets? I.e. what (sub)topic could these data be associated with? - open-ended question 10. Which of the data sets currently available could be valauble and useful for society and businesses? - open-ended question 11. Which of the data sets currently NOT available in Latvia’s open data portal could, in your opinion, be valauble and useful for society and businesses? - open-ended question 12. How did you define them? - {(1)Subjective opinion; (2) experience with data; (3) filtering out the most popular datasets, i.e. basing the on public opinion; (4) other (open-ended question)} 13. How high could be the value of these data sets value for you or your business? - 5-point Likert scale, where 1 – not valuable, 5 – highly valuable 14. Do you represent any company/ organization (are you working anywhere)? (if “yes”, please, fill out the survey twice, i.e. as an individual user AND a company representative) - {yes; no; I am an individual data user; other (open-ended)} 15. What industry/ sector does your company/ organization belong to? (if you do not work at the moment, please, choose the last option) - {Information and communication services; Financial and ansurance activities; Accommodation and catering services; Education; Real estate operations; Wholesale and retail trade; repair of motor vehicles and motorcycles; transport and storage; construction; water supply; waste water; waste management and recovery; electricity, gas supple, heating and air conditioning; manufacturing industry; mining and quarrying; agriculture, forestry and fisheries professional, scientific and technical services; operation of administrative and service services; public administration and defence; compulsory social insurance; health and social care; art, entertainment and recreation; activities of households as employers;; CSO/NGO; Iam not a representative of any company 16. To which category does your company/ organization belong to in terms of its size? - {small; medium; large; self-employeed; I am not a representative of any company} 17. What is the age group that you belong to? (if you are an individual user, not a company representative) - {11..15, 16..20, 21..25, 26..30, 31..35, 36..40, 41..45, 46+, “do not want to reveal”} 18. Please, indicate your education or a scientific degree that corresponds most to you? (if you are an individual user, not a company representative) - {master degree; bachelor’s degree; Dr. and/ or PhD; student (bachelor level); student (master level); doctoral candidate; pupil; do not want to reveal these data}
Format of the file .xls, .csv (for the first spreadsheet only), .odt
Licenses or restrictions CC-BY
This repository contains the code and data for the paper "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos". 🏠 Project Page📜 arXiv 🧑💻 GitHub Sa2VA is the first unified model for the dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation… See the full description on the dataset page: https://huggingface.co/datasets/Dense-World/Sa2VA-Training.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "life-on-earth"
Dataset Summary
The David Attenborough Research Consortium (DARC) loves David Attenborough (DA). And therefore we aim to enrich his fantastic work using modern deep learning, generative artificial intelligence (AI) methods and most recent assistants like ChatGPT. Those results, together with extracted and time stamped image frames ("frame_00000_hh-mm-ss.msmsms.jpg", ...) from videos constitutes the darcai-life-on-earth dataset. As a first… See the full description on the dataset page: https://huggingface.co/datasets/mikehemberger/darcai-life-on-earth.
The Tajik Living Standards Survey (TLSS) was conducted jointly by the State Statistical Agency and the Center for Strategic Studies under the Office of the President in collaboration with the sponsors, the United Nations Development Programme (UNDP) and the World Bank (WB). International technical assistance was provided by a team from the London School of Economics (LSE). The purpose of the survey is to provide quantitative data at the individual, household and community level that will facilitate purposeful policy design on issues of welfare and living standards of the population of the Republic of Tajikistan in 1999.
National coverage. The TLSS sample was designed to represent the population of the country as a whole as well as the strata. The sample was stratified by oblast and by urban and rural areas.
The country is divided into 4 oblasts, or regions; Leninabad in the northwest of the country, Khatlon in the southwest, Rayons of Republican Subordination (RRS) in the middle and to the west of the country, and Gorno-Badakhshan Autonomous Oblast (GBAO) in the east. The capital, Dushanbe, in the RRS oblast, is a separately administrated area. Oblasts are divided into rayons (districts). Rayons are further subdivided into Mahallas (committees) in urban areas, and Jamoats (villages) in rural areas.
Sample survey data [ssd]
The TLSS sample was designed to represent the population of the country as a whole as well as the strata. The sample was stratified by oblast and by urban and rural areas.
In common with standard LSMS practice a two-stage sample was used. In the first stage 125 primary sample units (PSU) were selected with the probability of selection within strata being proportional to size. At the second stage, 16 households were selected within each PSU, with each household in the area having the same probability of being chosen. [Note: In addition to the main sample, the TLSS also included a secondary sample of 15 extra PSU (containing 400 households) in Dangara and Varzob. Data in the oversampled areas were collected for the sole purpose of providing baseline data for the World Bank Health Project in these areas. The sampling for these additional units was carried out separately after the main sampling procedure in order to allow for their exclusion in nationally representative analysis.] The twostage procedure has the advantage that it provides a self-weighted sample. It also simplified the fieldwork operation as a one-field team could be assigned to cover a number of PSU.
A critical problem in the sample selection with Tajikistan was the absence of an up to date national sample frame from which to select the PSU. As a result lists of the towns, rayons and jamoats (villages) within rayons were prepared manually. Current data on population size according to village and town registers was then supplied to the regional offices of Goskomstat and conveyed to the center. This allowed the construction of a sample frame of enumeration units by sample size from which to draw the PSU.
This procedure worked well in establishing a sample frame for the rural population. However administrative units in some of the larger towns and in the cities of Dushanbe, Khojand and Kurgan-Tubbe were too large and had to be sub-divided into smaller enumeration units. Fortuitously the survey team was able to make use of information available as a result of the mapping exercise carried out earlier in the year as preparation for the 2000 Census in order to subdivide these larger areas into enumeration units of roughly similar size.
The survey team was also able to use the household listings prepared for the Census for the second stage of the sampling in urban areas. In rural areas the selection of households was made using the village registers – a complete listing of all households in the village which is (purported to be) regularly updated by the local administration. When selecting the target households a few extra households (4 in addition to the 16) were also randomly selected and were to be used if replacements were needed. In actuality non-response and refusals from households were very rare and use of replacement households was low. There was never the case that the refusal rate was so high that there were not enough households on the reserve list and this enabled a full sample of 2000 randomly selected households to be interviewed.
Face-to-face [f2f]
The questionnaire was based on the standard LSMS for the CIS countries, and adapted and abridged for Tajikistan. In particular the health section was extended to allow for more in depth information to be collected and a section on food security was also added. The employment section was reduced and excludes information on searching for employment.
The questionnaires were translated into Tajik, Russian and Uzbek.
The TLSS consists of three parts: a household questionnaire, a community level questionnaire and a price questionnaire.
Household questionnaire: the Household questionnaire is comprised of 10 sections covering both household and individual aspects.
Community/Population point Questionnaire: the Community level or Population Point Questionnaire consists of 8 sections. The community level questionnaire provides information on differences in demographic and economic infrastructure. Open-ended questions in the questionnaire were not coded and hence information on the responses to these qualitative questions is not provided in the data sets.
Summary of Section contents
The brief descriptions below provide a summary of the information found in each section. The descriptions are by no means exhaustive of the information covered by the survey and users of the survey need to refer to each particular section of the questionnaire for a complete picture of the information gathered.
Household information/roster This includes individual level information of all individuals in the household. It establishes who belongs to the household at the time of the interview. Information on gender, age, relation to household head and marital status are included. In the question relating to family status, question 7, “Nekared” means married where nekar is the Islamic (arabic) term for marriage contract. Under Islamic law a man may marry more than once (up-to four wives at any one time). Although during the Soviet period it was illegal to be married to more than one woman this practice did go on. There may be households where the household head is not present but the wife is married or nekared, or in the same household a respondent may answer married and another nekared to the household head.
Dwelling This section includes information covering the type of dwelling, availability of utilities and water supply as well as questions pertaining to dwelling expenses, rents, and the payment of utilities and other household expenses. Information is at the household level.
Education This section includes all individuals aged 7 years and older and looks at educational attainment of individuals and reasons for not continuing education for those who are not currently studying. Questions related to educational expenditures at the household level are also covered. Schooling in Tajikistan is compulsory for grades (classes) 1-9. Primary level education refers to grades 1 - 4 for children aged 7 to 11 years old. General secondary level education refers to grades 5-9, corresponding to the age group 12-16 year olds. Post-compulsory schooling can be divided into three types of school: - Upper secondary education covers the grades 10 and 11. - Vocational and Technical schools can start after grade 9 and last around 4 years. These schools can also start after grade 11 and then last only two years. Technical institutions provide medical and technical (e.g. engineering) education as well as in the field of the arts while vocational schools provide training for employment in specialized occupation. - Tertiary or University education can be entered after completing all 11 grades. - Kindergarten schools offer pre-compulsory education for children aged 3 – 6 years old and information on this type of schooling is not covered in this section.
Health This section examines individual health status and the nature of any illness over the recent months. Additional questions relate to more detailed information on the use of health care services and hospitals, including expenses incurred due to ill health. Section 4B includes a few terms, abbreviations and acronyms that need further clarification. A feldscher is an assistant to a physician. Mediniski dom or FAPs are clinics staffed by physical assistants and/or midwifes and a SUB is a local clinic. CRH is a local hospital while an oblast hospital is a regional hospital based in the oblast administrative centre, and the Repub. Hospital is a national hospital based in the capital, Dushanbe. The latter two are both public hospitals.
Employment This section covers individuals aged 11 years and over. The first part of this section looks at the different activities in which individuals are involved in order to determine if a person is engaged in an income generating activity. Those who are engaged in such activities are required to answer questions in Part B. This part relates to the nature of the work and the organization the individual is attached to as well as questions relating to income, cash income and in-kind payments. There are also a few questions relating to additional income generating activities in addition to the main activity. Part C examines employment
No description is available. Visit https://dataone.org/datasets/e6506f5b521aecb9467b7eef5009e104 for complete metadata about this dataset.
This repository hosts datasets used in the project: DATA ANALYSIS OF DATETIME BASED OCR. These datasets are derived from surveillance videos embedded with overlay text showing the date and time of recording in YYYY-MM-DD
and HH:MM:SS
formats, respectively. The datasets are intended for use in OCR (Optical Character Recognition) training and evaluation, particularly in timestamp recognition tasks.
Dataset | Date Captured | Time Span | Dimensions (px) | File Size Range | Duration |
---|---|---|---|---|---|
1 | 25 October 2024 | 14:34:20 – 21:02:35 | 457 × 55 | 3–11 KB | ~7 hours |
2 | 19 October 2023 | 11:54:09 – 21:12:47 | 224 × 25 | 1–4 KB | ~9 hours |
3 | 10 January 2024 | 00:05:45 – 23:58:45 | 420 × 50 | 2–8 KB | ~24 hours |
Each image is a cropped region containing the timestamp overlay extracted from a video frame. The datasets include various degrees of corruption, camera motion, and resolution to reflect real-world surveillance conditions.
.ts
)[left=1438, top=15, right=1895, bottom=70]
Each frame in the video is read at specified intervals, cropped using predefined coordinates, and saved in .jpg
format to the corresponding dataset folder.
OCR-based timestamp labelling was performed semi-automatically using PaddleOCR with the following setup:
lang='en'
)20240110 → 2024-01-10
).Metadata includes:
Ground truth results are available in:
It is observed that the training losses for both models are considerably higher than validation loss, which is less common behaviour. Suspecting that data quirks may be in play, the datasets are reevaluated.
Due to the qualities of time, datetime data may present bias as time components of a higher degree persist throughout the dataset for a longer period than those of lower degree. To confirm that such bias is not amplified further by subsequent frames of same timestamps, all the datasets are filtered to remove images with duplicate timestamp values by keeping the first occurrence only. Then, they are reallocated into training and testing with the same train-test ratio, without similar data being present in both. This is done so that the model trains on diverse text samples rather than repeated words to better evaluate the models’ exploration and generalisation capabilities.
Dataset | Train Size | Test Size |
---|---|---|
1 | 50,854 | 12,713 |
2 | 112,816 | 28,204 |
3 | 5,780 | 1,446 |
Filtered Dataset | Train Size | Test Size |
---|---|---|
1 | 25,505 | 6,377 |
2 | 56,718 | 14,180 |
3 | 3,213 | 804 |
The "datasets 4.5.1 & 4.5.2" and "datasets 4.5.3 & 4.5.4" refer to the same datasets used in experiments detailed in sections 4.5.1–2 and 4.5.3–4 of the project respectively. The latter group of datasets have undergone a filtering process to remove duplicate timestamp instances.
For more information, please refer to the main repository: 👉 IvannaLin/DATA-ANALYSIS-OF-DATETIME-BASED-OCR
If you use this dataset in your research, please cite the associated source or contact the corresponding author.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!
This suite of datasets is:
- Realistic, based on a present-day real-world dataset for fraud detection;
- Biased, each dataset has distinct controlled types of bias;
- Imbalanced, this setting presents a extremely low prevalence of positive class;
- Dynamic, with temporal data and observed distribution shifts;
- Privacy preserving, to protect the identity of potential applicants we have applied differential privacy techniques (noise addition), feature encoding and trained a generative model (CTGAN).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2F4271ec763b04362801df2660c6e2ec30%2FScreenshot%20from%202022-11-29%2017-42-41.png?generation=1669743799938811&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Faf502caf5b9e370b869b85c9d4642c5c%2FScreenshot%20from%202022-12-15%2015-17-59.png?generation=1671117525527314&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Ff3789bd484ee392d648b7809429134df%2FScreenshot%20from%202022-11-29%2017-40-58.png?generation=1669743681526133&alt=media" alt="">
Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of “month”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income).
Detailed information (datasheet) on the suite: https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf
Check out the github repository for more resources and some example notebooks: https://github.com/feedzai/bank-account-fraud
Read the NeurIPS 2022 paper here: https://arxiv.org/abs/2211.13358
Learn more about Feedzai Research here: https://research.feedzai.com/
Please, use the following citation of BAF dataset suite
@article{jesusTurningTablesBiased2022,
title={Turning the {{Tables}}: {{Biased}}, {{Imbalanced}}, {{Dynamic Tabular Datasets}} for {{ML Evaluation}}},
author={Jesus, S{\'e}rgio and Pombal, Jos{\'e} and Alves, Duarte and Cruz, Andr{\'e} and Saleiro, Pedro and Ribeiro, Rita P. and Gama, Jo{\~a}o and Bizarro, Pedro},
journal={Advances in Neural Information Processing Systems},
year={2022}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for DWD ICON Global Forecast
This dataset is comprised of forecasts from the German Weather Service's (DWD) ICON-Global model from March 2023 to the present with all variables included. Each forecast runs up to 4 days into the future, and the model is ran 4 times per day. This data is an archive of the publicly available data at https://opendata.dwd.de/weather/nwp/, converted to Zarr format with Xarray. No other processing of the data is performed.
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/openclimatefix/dwd-icon-global.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This GPS trajectory dataset was collected in (Microsoft Research) Geolife project by 178 users in a period of over four years (from April 2007 to October 2011). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of 1,251,654 kilometers and a total duration of 48,203 hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.
This dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling.
Data Format - Trajectory file Every single folder of this dataset stores a user’s GPS log files, which were converted to PLT format. Each PLT file contains a single trajectory and is named by its starting time. To avoid potential confusion of time zone, we use GMT in the date/time property of each point, which is different from our previous release. - PLT format: Line 1…6 are useless in this dataset, and can be ignored. Points are described in following lines, one for each line. Field 1: Latitude in decimal degrees. Field 2: Longitude in decimal degrees. Field 3: All set to 0 for this dataset. Field 4: Altitude in feet (-777 if not valid). Field 5: Date - number of days (with fractional part) that have passed since 12/30/1899. Field 6: Date as a string. Field 7: Time as a string. Note that field 5 and field 6&7 represent the same date/time in this dataset. You may use either of them. Example: 39.906631,116.385564,0,492,40097.5864583333,2009-10-11,14:04:30 39.906554,116.385625,0,492,40097.5865162037,2009-10-11,14:04:35 - Transportation mode labels Possible transportation modes are: walk, bike, bus, car, subway, train, airplane, boat, run and motorcycle. Again, we have converted the date/time of all labels to GMT, even though most of them were created in China. Example: Start Time End TimeTransportation Mode 2008/04/02 11:24:21 2008/04/02 11:50:45 bus 2008/04/03 01:07:03 2008/04/03 11:31:55 train 2008/04/03 11:32:24 2008/04/03 11:46:14 walk 2008/04/03 11:47:14 2008/04/03 11:55:07 car
First, you can regard the label of both taxi and car as driving although we set them with different labels for future usage. Second, a user could label the transportation mode of a light rail as train while others may use subway as the label. Actually, no trajectory can be recorded in an underground subway system since a GPS logger cannot receive any signal there. In Beijing, the light rails and subway systems are seamlessly connected, e.g., line 13 (a light rail) is connected with line 10 and line 2, which are subway systems. Sometimes, a line (like line 5) is comprised of partial subways and partial light rails. So, users may have a variety of understanding in their transportation modes. You can differentiate the real train trajectories (connecting two cities) from the light rail trajectory (generating in a city) according to their distances. Or, just treat them the same.
More: User Guide: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/User20Guide-1.2.pdf
Please cite the following papers when using this GPS dataset. [1] Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma. Mining interesting locations and travel sequences from GPS trajectories. In Proceedings of International conference on World Wild Web (WWW 2009), Madrid Spain. ACM Press: 791-800.
[2] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie, Wei-Ying Ma. Understanding Mobility Based on GPS Data. In Proceedings of ACM conference on Ubiquitous Computing (UbiComp 2008), Seoul, Korea. ACM Press: 312-321. [3] Yu Zheng, Xing Xie, Wei-Ying Ma, GeoLife: A Collaborative Social Networking Service among User, location and trajectory. Invited paper, in IEEE Data Engineering Bulletin. 33, 2, 2010, pp. 32-40.
This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.