ERDDAP数据集加载

GenerateDatasetsXml is a command line program that can generate a rough draft of the dataset XML for almost any type of dataset.

We STRONGLY RECOMMEND that you use GenerateDatasetsXml instead of creating chunks of datasets.xml by hand because:

  • GenerateDatasetsXml works in seconds. Doing this by hand is at least an hour's work, even when you know what you're doing.
  • GenerateDatasetsXml does a better job. Doing this by hand requires extensive knowledge of how ERDDAP works. It is unlikely that you will do a better job by hand. (Bob Simons always uses GenerateDatasetsXml for the first draft, and he wrote ERDDAP.)
  • GenerateDatasetsXml always generates a valid chunk of datasets.xml. Any chunk of datasets.xml that you write will probably have at least a few errors that prevent ERDDAP from loading the dataset. It often takes people hours to diagnose these problems. Don't waste your time. Let GenerateDatasetsXml do the hard work. Then you can refine the .xml by hand if you want.

命令行加载数据被源代码作者bob.simons强烈推荐,

GenerateDatasetsXml first asks you to specify the EDDType (Erd Dap Dataset Type) of the dataset. See the List of Dataset Types (in this document) to figure out which is type appropriate for the dataset you are working on. 

So在命令行之前,你得熟悉支持的格式和参数,我测试的nc数据,bob给的示例是Ascii数据

Let's say you run the script: ./GenerateDatasetsXml.sh 
Then enter: EDDTableFromNcFiles 
Then enter: /u00/data/ 
Then enter: .*\.nc
Then enter: /u00/data/sampleFile.nc
Then enter: 10

DISCLAIMER: The chunk of datasets.xml made by GenerateDatasetsXml isn't perfect. YOU MUST READ AND EDIT THE XML BEFORE USING IT IN A PUBLIC ERDDAP. 

命令行输出的成果不一定正确,故要修改

Diagnostic information and the rough draft of the dataset XML will be written tobigParentDirectory/logs/GenerateDatasetsXml.log .

The rough draft of the dataset XML will be written tobigParentDirectory/logs/GenerateDatasetsXml.out .

数据会先保存在GenerateDatasetsXml.out文件中

如有问题参照以下

Other Ways To Diagnose Problems With Datasets 
In addition to the two main Tools,

  • log.txt is a log file with all of ERDDAP's diagnostic messages.
  • The Daily Report has more information than the status page, including a list of datasets that didn't load and the exceptions (errors) they generated.
  • The Status Page is a quick way to check ERDDAP's status from any web browser. It includes a list of datasets that didn't load (although not the related exceptions) and taskThread statistics (showing the progress of EDDGridCopy and EDDTableCopydatasets).
  • If you get stuck, please send an email with the details to bob dot simons at noaa dot gov. 
    Or, you can join the ERDDAP Google Group / Mailing List and post your question there. 

log.txt会有错误提示,Daily Report会发送到邮箱,页面status也可以参考

Limits to the Size of a Dataset 
You'll see many references to "2 billion" below. More accurately, that is a reference to 2,147,483,647 (2^31-1), which is the maximum value of a 32-bit signed integer. In some computer languages, for example Java (which ERDDAP is written in), that is the largest data type that can be used for many data structures (for example, the size of an array).

For String values (for example, for variable names, attribute names, String attribute values, and String data values), the maximum number of characters per String in ERDDAP is ~2 billion. But in almost all cases, there will be small or large problems if a String exceeds a reasonable size (e.g., 80 characters for variable names and attribute names, and 255 characters for most String attribute values and data values). For example, web pages which display long variable names will be awkwardly wide and long variable names will be truncated if they exceed the limit of the response file type.

For gridded datasets:

  • The maximum number of axisVariables is ~2 billion. 
    The maximum number of dataVariables is ~2 billion. 
    But if a dataset has >100 variables, it will be cumbersome for users to use. 
    And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
  • The maximum size of each dimension (axisVariable) is ~2 billion values.
  • I think the maxumum total number of cells (the product of all dimension sizes) is unlimited, but it may be ~9e18.

For tabular datasets:

  • The maximum number of dataVariables is ~2 billion. 
    But if a dataset has >100 variables, it will be cumbersome for users to use. 
    And if a dataset has >1 million variables, your server will need a lot of physical memory and there will be other problems.
  • The maximum number of sources (for example, files) that can be aggregated is ~2 billion.
  • In a some cases, the maximum number of rows from an individual source (for example, a file, but not a database) is ~2 billion rows.
  • I don't think there are other limits.

数据结构的多少有限制,20亿条

当然单个数据集的大小也有限制

For both gridded and tabular datasets, there are some internal limits on the size of the subset that can be requested by a user in a single request (often related to >2 billion of something or ~9e18 of something), but it is far more likely that a user will hit the file-type-specific limits.

  • NetCDF version 3 .nc files are limited to 2GB bytes. (If this is really a problem for someone, let me know: I could add support for the NetCDF version 3 .nc 64-bit extension or NetCDF Version 4, which would increase the limit significantly, but not infinitely.)
  • Browsers crash after only ~500MB of data, so ERDDAP limits the response to .htmlTable requests to ~400MB of data.
  • Many data analysis programs have similar limits (for example, the maximum size of a dimension is often ~2 billion values), so there is no reason to work hard to get around the file-type-specific limits.
  • The file-type-specific limits are useful in that they prevent naive requests for truly huge amounts of data (for example, "give me all of this dataset" when the dataset has 20TB of data), which would take weeks or months to download. The longer the download, the more likely it will fail for a variety of reasons.
  • The file-type-specific limits are useful in that they force the user to deal with reasonably-sized subsets (for example, dealing with a large gridded dataset via files with data from one time point each). 

nc文件大小限制2G

另外,dataset ID 是每个数据集必须的、独一的识别号,自动生成时,都一样,需要修改

datasetID="aDatasetID" is a REQUIRED attribute within a <dataset> tag which assigns a short (usually <15 characters), unique, identifying name to a dataset.

  • Valid characters are A-Z, a-z, 0-9, _, and -, but we strongly recommend starting with a letter and then just using A-Z, a-z, 0-9, and _.
  • DatasetID's are case sensitive, but DON'T create two datasetID's that only differ in upper/lower case letters. It will cause problems on Windows computers (yours and/or a user's computer).
  • Best practices: We recommend using camelCase(external link).
  • Best practices: We recommend that the first part be an acronym or abbreviation of the source institution's name and the second part be an acronym or abbreviation of the dataset's name. When possible, we create a name which reflect's the source's name for the dataset. For example, we used datasetID="erdPHssta8day" for a dataset from the NOAA NMFS SWFSC Environmental Research Division (ERD) which is designated by the source to be satellite/PH/ssta/8day.
  • If you want to change a dataset's name, you need to do something to kill off (err, retire) the dataset with the existing name. The two solutions are:
    • Shutdown Tomcat/ERDDAP. Change the name. Restart ERDDAP.
    • Change the name of the dataset and make a dummy, active="false" dataset to kill off (err, retire) the old dataset: 
      <dataset type="EDDTableFromNcFiles" datasetID="theOldName" active="false" /> 
      You can remove that tag after the old dataset is inactive. 

 

主要用到的

/usr/local/apahce-tomcat/content/erddap/setup.xml

/usr/local/apahce-tomcat/content/erddap/datasets.xml

/usr/local/erddap/logs/GenerateDatasetsXml.out

/usr/local/apache-tomcat/webb/erddap/WEB-INF/GenerateDatasetsXml.sh

更多细节还需要参考文档,有东西下次再写

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

百老

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值