Discussion:
discussion of createrepo and repodata format future
seth vidal
2010-08-04 14:59:58 UTC
Permalink
Hi folks,
instead of trundling down the existing thread I thought we could
[re]start a discussion of things that could be improved in future
versions of the repodata and laying out the data.


there have been a number of times where I wished we could have arranged
some of the metadata differently and there is a lot of need for
optimizations of what has to be downloaded. I've been thinking about and
discussing it casually with various folks for quite a while but not
acted on anything.

there was a discussion in June on yum-devel:
http://lists.baseurl.org/pipermail/yum-devel/2010-June/007123.html

Let's be clear, though, given how repodata/repomd.xml is laid out there
is no requirement to break backward compat, we can add datatypes and
move forward that way. so there's no reason for drama or gnashing of
teeth. The only trick will be if all repos support all versions of the
metadata, either by their own choice, or by requirement.


The general idea has been to:
- make the metadata more 'chunkable' so you can retrieve smaller pieces
of it and only retrieve more complete sets when you need them or when
there is no disadvantage to getting everything in one blob.

- make it more trivial to search

- keep it as verifiable as it is now

- make it possible to know when a repo has 'expired', if ever

- not break everyone in the process

- make it easier to provide translatable chunks for the
summary/description fields of pkgs

- make it more obvious where/how external metadata(external to the rpms
themselves) can be added to a repository.


The reasons for these changes are fairly simple: with repos growing in
size it is becoming less and less manageable to download huge xml or
sqlite files to update your local cache.

-sv
Duncan Mac-Vicar P.
2010-08-06 10:20:11 UTC
Permalink
Post by seth vidal
Hi folks,
instead of trundling down the existing thread I thought we could
[re]start a discussion of things that could be improved in future
versions of the repodata and laying out the data.
I was thinking about the sqlite part, and I think the change has to go
in a way that
* assumes sqlite3 is _yum_ cache
* allows distros to use different cache strategies

The sqlite types could go as extra data. Something like:

<data type="filelists">
<checksum
type="sha256">1ef045e8c2a00901b1c3591990b2ec2498d2ed1807b315622a89aa73e6811778</checksum>
<timestamp>1277771835</timestamp>
<size>1098607</size>
<open-size>15276280</open-size>
<open-checksum
type="sha256">7a620afa9594e0f9be4048d1968881079174b6a4600617f55da8bd0f0ae4a4a6</open-checksum>
<location href="repodata/filelists.xml.gz"/>
<cache-location format="sqlite3" href="repodata/filelists.xml.sqldb"/>
</data>

Another problem that shows up is that caches may be different. For
example, yum has one file per metadata. We have just one solv file per
repository. But I guess that is just another tag under repomd tag.
Specifying the location, and the format, so that the downloader can
choose whether to use it or not.

Also, what is the status of these tags?

<revision>1277771829</revision>
<tags>
<repo>obsrepository://build.opensuse.org/openSUSE:11.2:Contrib/standard</repo>
</tags>

In repomd.xml, we are already using them.

Duncan
seth vidal
2010-08-06 12:34:15 UTC
Permalink
Post by Duncan Mac-Vicar P.
I was thinking about the sqlite part, and I think the change has to go
in a way that
* assumes sqlite3 is _yum_ cache
why? Why not just a random-access md?
Post by Duncan Mac-Vicar P.
* allows distros to use different cache strategies
That's true even now.
Post by Duncan Mac-Vicar P.
<data type="filelists">
<checksum
type="sha256">1ef045e8c2a00901b1c3591990b2ec2498d2ed1807b315622a89aa73e6811778</checksum>
<timestamp>1277771835</timestamp>
<size>1098607</size>
<open-size>15276280</open-size>
<open-checksum
type="sha256">7a620afa9594e0f9be4048d1968881079174b6a4600617f55da8bd0f0ae4a4a6</open-checksum>
<location href="repodata/filelists.xml.gz"/>
<cache-location format="sqlite3" href="repodata/filelists.xml.sqldb"/>
</data>
I don't have a problem with that necessarily except that it means that
every non-xml format is relegated to a 'cache', which I just don't think
is true.

When you think about it - the xml format is a cache, too. It's a
compilation, truncation and compression of the data in the rpm hdrs.

Something that we've kicked about is mix-and-match metadata fileformats
based on what makes accessing them most functional. I know opensuse
doesn't have to muck with filedeps in pkgs but do you resolve them for
when a user knows the filename they need but not the pkg name?

the equivalent of: yum install /usr/bin/myprogram

If so - the discussion of breaking of the filelists into chunks based on
subdirs might be beneficial to you.
Post by Duncan Mac-Vicar P.
Another problem that shows up is that caches may be different. For
example, yum has one file per metadata. We have just one solv file per
repository. But I guess that is just another tag under repomd tag.
Specifying the location, and the format, so that the downloader can
choose whether to use it or not.
Also, what is the status of these tags?
<revision>1277771829</revision>
<tags>
<repo>obsrepository://build.opensuse.org/openSUSE:11.2:Contrib/standard</repo>
</tags>
They shouldn't need to change at all.

In fact my original thoughts were that repomd.xml doesn't need to
change, except to add an attribute of 'compression type' to each
datatype.


-sv
Michael Schroeder
2010-08-06 12:51:33 UTC
Permalink
Post by seth vidal
Something that we've kicked about is mix-and-match metadata fileformats
based on what makes accessing them most functional. I know opensuse
doesn't have to muck with filedeps in pkgs but do you resolve them for
when a user knows the filename they need but not the pkg name?
the equivalent of: yum install /usr/bin/myprogram
This would work because /usr/bin/ files are in primary.xml.
libzypp doesn't support on-demand download of the filelist,
libsatsolver offers an interface for it.
Post by seth vidal
If so - the discussion of breaking of the filelists into chunks based on
subdirs might be beneficial to you.
Well, that depends on the usage pattern. As zypp doesn't support
to download the filelist, we make sure that there are no (Build)Requires
to files outside of the "primary" filepatterns. As /bin and /usr/bin
are used in many packages, we need to download the file information
for those directories anyway, so moving them from primary into seperate
files makes things worse for us.
Post by seth vidal
Post by Duncan Mac-Vicar P.
Another problem that shows up is that caches may be different. For
example, yum has one file per metadata. We have just one solv file per
repository. But I guess that is just another tag under repomd tag.
Specifying the location, and the format, so that the downloader can
choose whether to use it or not.
Also, what is the status of these tags?
<revision>1277771829</revision>
<tags>
<repo>obsrepository://build.opensuse.org/openSUSE:11.2:Contrib/standard</repo>
</tags>
They shouldn't need to change at all.
In fact my original thoughts were that repomd.xml doesn't need to
change, except to add an attribute of 'compression type' to each
datatype.
Oh, that's also what I was thinking of. I didn't like the "primary_lzma"
type because it's not different data, but just a different representation.
I'd prefer:

<data type="primary">...</data>
<data type="primary" flavor="lzma">...</data>
<data type="primary" flavor="sqlite">...</data>

But I don't know how many repomd parsers would choke on this...

Speaking of that lzma patch, I pretty much opposed it because it
conflicts with the "delta download" mechanism I implemented some weeks
ago. The idea is to use 'gzip --rsyncable' for gz compression, add 'zsync'
checksum data to the metalink files and let libzypp download just the
changed blocks with range requests. Works quite nice for our maintenance
updates, it's proably not very useful for Factory (i.e. "rawhide") where
the number of rebuilds is quite high.

Cheers,
Michael.
--
Michael Schroeder ***@suse.de
SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
seth vidal
2010-08-06 13:40:54 UTC
Permalink
Post by Michael Schroeder
Post by seth vidal
the equivalent of: yum install /usr/bin/myprogram
This would work because /usr/bin/ files are in primary.xml.
libzypp doesn't support on-demand download of the filelist,
libsatsolver offers an interface for it.
rhel/fedora has a fair number of non *bin/* filedeps that we have to
look up. And try as I might, I can't seem to make that number go down.
Post by Michael Schroeder
Well, that depends on the usage pattern. As zypp doesn't support
to download the filelist, we make sure that there are no (Build)Requires
to files outside of the "primary" filepatterns. As /bin and /usr/bin
are used in many packages, we need to download the file information
for those directories anyway, so moving them from primary into seperate
files makes things worse for us.
B/c of the extra file to download? B/c the size shouldn't change. With
some of the suggestions in changing of primary, I'd think the size
should decrease overall.

Did you have any thoughts on the suggestion of breaking summary and
description out into translatable files?
Post by Michael Schroeder
Post by seth vidal
In fact my original thoughts were that repomd.xml doesn't need to
change, except to add an attribute of 'compression type' to each
datatype.
Oh, that's also what I was thinking of. I didn't like the "primary_lzma"
type because it's not different data, but just a different representation.
I agree - but for backward compat that was the only sane way to keep
from breaking all the existing parsers.
Post by Michael Schroeder
<data type="primary">...</data>
<data type="primary" flavor="lzma">...</data>
<data type="primary" flavor="sqlite">...</data>
But I don't know how many repomd parsers would choke on this...
Probably a lot of them would choke on it. Pretty sure yum would pick one
of them at random.


Nitpick but: if I never see 'flavor' as a descriptor of something that
is NOT food, that'll be fine. It's like file 'colors'. That just makes
me want to scream whenever I read it.
Post by Michael Schroeder
Speaking of that lzma patch, I pretty much opposed it because it
conflicts with the "delta download" mechanism I implemented some weeks
ago. The idea is to use 'gzip --rsyncable' for gz compression, add 'zsync'
checksum data to the metalink files and let libzypp download just the
changed blocks with range requests. Works quite nice for our maintenance
updates, it's proably not very useful for Factory (i.e. "rawhide") where
the number of rebuilds is quite high.
Where is this patch?

-sv
Michael Schroeder
2010-08-06 14:57:08 UTC
Permalink
Post by seth vidal
Post by Michael Schroeder
Well, that depends on the usage pattern. As zypp doesn't support
to download the filelist, we make sure that there are no (Build)Requires
to files outside of the "primary" filepatterns. As /bin and /usr/bin
are used in many packages, we need to download the file information
for those directories anyway, so moving them from primary into seperate
files makes things worse for us.
B/c of the extra file to download?
Yes, though with keep-alive connection it probably doesn't matter
much. Also, compression might not work as good if the data is split
into multiple files and each file is compressed seperately.
Post by seth vidal
B/c the size shouldn't change. With
some of the suggestions in changing of primary, I'd think the size
should decrease overall.
Did you have any thoughts on the suggestion of breaking summary and
description out into translatable files?
Not really. Do you mean somethink like this (xml version):

primary.de.xml.gz:
...
<package pkgid="xxx" >
<summary lang="de">Coole Applikation</summary>
<description lang="de">Macht was tolles...</description>
</package>

The sqlite version could be similar.
Post by seth vidal
[...]
Nitpick but: if I never see 'flavor' as a descriptor of something that
is NOT food, that'll be fine. It's like file 'colors'. That just makes
me want to scream whenever I read it.
Heh ;-)
Post by seth vidal
Post by Michael Schroeder
Speaking of that lzma patch, I pretty much opposed it because it
conflicts with the "delta download" mechanism I implemented some weeks
ago. The idea is to use 'gzip --rsyncable' for gz compression, add 'zsync'
checksum data to the metalink files and let libzypp download just the
changed blocks with range requests. Works quite nice for our maintenance
updates, it's proably not very useful for Factory (i.e. "rawhide") where
the number of rebuilds is quite high.
Where is this patch?
See my commits in http://gitorious.org/opensuse/libzypp/commits/master,
especially commit c3ba229.

Zsync works by searching local files for blocks with the same checksum
as the target file. As checksum calculation is not a cheap operation,
you can't simply do it for every byte offset in the local files. Thus
you also need a cheap checksum, and you only verify with the real
checksum if the cheap checksum matches.

Metalinks already comes with support for multiple checksums, so it
was straightforward to add zsync support. For example, the metalink
for the current 11.3 primary.xml.gz looks like:

...
<hash type="sha256">a712c132725a5a58db3aba53c7dba2cfe61789d8c0deda3591aa0aaa2b55a48a</hash>
<pieces length="131072" type="zsync">
<hash piece="0">324641aa</hash>
<hash piece="1">3b5d83ae</hash>
</pieces>
<pieces length="131072" type="sha1">
<hash piece="0">7f2f37fa19dbf953f4ba4c5f19b62fc5053a42d6</hash>
<hash piece="1">451014563b43ffa75a7a59522beed72a4692b25c</hash>
</pieces>
...

So when libzypp wants to download the new primary file and there
is already an old version it first looks if it finds blocks matching
the checksums in the old file and downloads only the blocks that
couldn't be found. (The code also downloads in parallel from multiple
mirrors, but that's more like a wanted side effect ;-))

This scheme probably only works with xml (where new packages just
get added to the end of the file, at least for our updates) and
with a --rsyncable compression method.

(When you did a fresh installation you would suffer from the
not-optimal compression, so it might make sense to offer *both*
primary.xml.gz (or primary.xml) and primary.xml.lzma. Fresh
installations would use the lzma compressed version and
systems that have an old primary version would use the .gz
variant that supports delta downloads. Actually the library
could first check how many blocks match and then use the
optimal method.)

Cheers,
Michael.
--
Michael Schroeder ***@suse.de
SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
seth vidal
2010-08-06 17:02:47 UTC
Permalink
Post by Michael Schroeder
Yes, though with keep-alive connection it probably doesn't matter
much. Also, compression might not work as good if the data is split
into multiple files and each file is compressed seperately.
I think the compression won't be impacted much - if only b/c splitting
the files out by subdir mean we can drop out the first section entirely:

if we know the file represents /usr/bin

well then we don't need to include that in any of the files there.
Post by Michael Schroeder
Post by seth vidal
B/c the size shouldn't change. With
some of the suggestions in changing of primary, I'd think the size
should decrease overall.
Did you have any thoughts on the suggestion of breaking summary and
description out into translatable files?
...
<package pkgid="xxx" >
<summary lang="de">Coole Applikation</summary>
<description lang="de">Macht was tolles...</description>
</package>
The sqlite version could be similar.
More or less -but again - using the name/location/label of the file to
determine language - so you don't have a bunch of duplicate info.

if the file is named trans.de_DE.xml.gz then we don't really need to
label summary and description with lang='de' do we?
Post by Michael Schroeder
Post by seth vidal
Post by Michael Schroeder
Speaking of that lzma patch, I pretty much opposed it because it
conflicts with the "delta download" mechanism I implemented some weeks
ago. The idea is to use 'gzip --rsyncable' for gz compression, add 'zsync'
checksum data to the metalink files and let libzypp download just the
changed blocks with range requests. Works quite nice for our maintenance
updates, it's proably not very useful for Factory (i.e. "rawhide") where
the number of rebuilds is quite high.
Where is this patch?
See my commits in http://gitorious.org/opensuse/libzypp/commits/master,
especially commit c3ba229.
Zsync works by searching local files for blocks with the same checksum
as the target file. As checksum calculation is not a cheap operation,
you can't simply do it for every byte offset in the local files. Thus
you also need a cheap checksum, and you only verify with the real
checksum if the cheap checksum matches.
Ah - now I recall - the need for the hashing and keeping around older
revisions of the metadata is what made zsync less palatable.

Since each pkg is an individual 'chunk' of data that is needed to
comprise the whole of the repodata, one thought was generating both a
complete copy of the repodata and a discrete chunk of the metadata
per-pkg.

So if I look up the pkglist and see that the changeset from the last
time I got the metadata was the addition/update of 100 pkgs and the
removal of 20 and downloading the metadata for those 100 pkgs is smaller
than downloading the whole thing, then I could just do that and create
the new metadata on my own.

It's a slightly more coarse-grained delta'ing of the metadata but in
discrete chunks that make sense to the user, not just to the parser.
Post by Michael Schroeder
This scheme probably only works with xml (where new packages just
get added to the end of the file, at least for our updates) and
with a --rsyncable compression method.
you add the content to the end of the existing metadata? Does that mean
your metadata grows w/o bound over the duration of a release?
Post by Michael Schroeder
(When you did a fresh installation you would suffer from the
not-optimal compression, so it might make sense to offer *both*
primary.xml.gz (or primary.xml) and primary.xml.lzma. Fresh
installations would use the lzma compressed version and
systems that have an old primary version would use the .gz
variant that supports delta downloads. Actually the library
could first check how many blocks match and then use the
optimal method.)
That's what I was suggesting by having per-pkg chunks available AS well
as complete sets.

-sv
Michael Schroeder
2010-08-06 17:15:57 UTC
Permalink
Post by seth vidal
More or less -but again - using the name/location/label of the file to
determine language - so you don't have a bunch of duplicate info.
if the file is named trans.de_DE.xml.gz then we don't really need to
label summary and description with lang='de' do we?
Right.
Post by seth vidal
Post by Michael Schroeder
Zsync works by searching local files for blocks with the same checksum
as the target file. As checksum calculation is not a cheap operation,
you can't simply do it for every byte offset in the local files. Thus
you also need a cheap checksum, and you only verify with the real
checksum if the cheap checksum matches.
Ah - now I recall - the need for the hashing and keeping around older
revisions of the metadata is what made zsync less palatable.
Since each pkg is an individual 'chunk' of data that is needed to
comprise the whole of the repodata, one thought was generating both a
complete copy of the repodata and a discrete chunk of the metadata
per-pkg.
So if I look up the pkglist and see that the changeset from the last
time I got the metadata was the addition/update of 100 pkgs and the
removal of 20 and downloading the metadata for those 100 pkgs is smaller
than downloading the whole thing, then I could just do that and create
the new metadata on my own.
Yes, that would be possible. It depends on how good each per-pkg chunk
compresses, I guess.
The advantage of the all chunks in one file with delta downloads is that
it doesn't clutter up the repo directory so much.
Post by seth vidal
Post by Michael Schroeder
This scheme probably only works with xml (where new packages just
get added to the end of the file, at least for our updates) and
with a --rsyncable compression method.
you add the content to the end of the existing metadata? Does that mean
your metadata grows w/o bound over the duration of a release?
If rpms get deleted they also vanish from the metadata. But we tend
to keep the old rpms for reference, so usually only the number of packages
at the start of the xml data changes and new packages get appended.

Cheers,
Michael.
--
Michael Schroeder ***@suse.de
SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
seth vidal
2010-08-06 18:24:47 UTC
Permalink
Post by Michael Schroeder
Right.
is this an issue for opensuse? I know it's one for fedora b/c of wanting
folks to be able to search for pkgs based on summary/description info in
their own language.
Post by Michael Schroeder
Yes, that would be possible. It depends on how good each per-pkg chunk
compresses, I guess.
The advantage of the all chunks in one file with delta downloads is that
it doesn't clutter up the repo directory so much.
Agreed - the thought was to have them all in a subdir with an index for
them so you could get them and use the same gpg/x509 verifier for them.


If you remove all the crazy you can make the full pkg metadata pretty
small.
Post by Michael Schroeder
If rpms get deleted they also vanish from the metadata. But we tend
to keep the old rpms for reference, so usually only the number of packages
at the start of the xml data changes and new packages get appended.
yah - the above is where we got into trouble in fedora - b/c the sheer
volume of updates is a bit crushing if we tried to tack them on the end
of the xml. This is part of the reason to looking at pure-sqlite repos
so we could do all updates as just insert/deletes + repomd.xml
regeneration.


-sv
Michael Schroeder
2010-08-07 15:43:38 UTC
Permalink
Post by seth vidal
is this an issue for opensuse? I know it's one for fedora b/c of wanting
folks to be able to search for pkgs based on summary/description info in
their own language.
Yes, it's an issue for the update repositories. It's not an issue
for the DVDs, as they still use the old "susetags" format which
already as the translations in different files since many years.
Post by seth vidal
Post by Michael Schroeder
If rpms get deleted they also vanish from the metadata. But we tend
to keep the old rpms for reference, so usually only the number of packages
at the start of the xml data changes and new packages get appended.
yah - the above is where we got into trouble in fedora - b/c the sheer
volume of updates is a bit crushing if we tried to tack them on the end
of the xml. This is part of the reason to looking at pure-sqlite repos
so we could do all updates as just insert/deletes + repomd.xml
regeneration.
Hmm, I thought that those sqlite databases are generated from
scratch. Don't you suffer from fragmentation issues if you
just do insert/deletes? Our libzypp folkes experimented with a
single sqlite database that contains the data of all repositories,
and while it was fast in the beginning it got slower and slower
the more the database was changed. (That may be fixed in current
sqlite versions, though. Or maybe I remember it wrong.)

Cheers,
Michael.
--
Michael Schroeder ***@suse.de
SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
seth vidal
2010-08-09 18:02:01 UTC
Permalink
Post by Michael Schroeder
Hmm, I thought that those sqlite databases are generated from
scratch. Don't you suffer from fragmentation issues if you
just do insert/deletes? Our libzypp folkes experimented with a
single sqlite database that contains the data of all repositories,
and while it was fast in the beginning it got slower and slower
the more the database was changed. (That may be fixed in current
sqlite versions, though. Or maybe I remember it wrong.)
Sorry if I was unclear. I'm not talking abour merging across multiple
repos. I just mean per-repo writing directly to the sqlite instead of
going through the intermediary step of the xml. Even if we end up
rewriting the sqlite files to avoid fragmentation issues accessing and
dumping out pkg objects from the sqlite is faster than from the xml -
for random access especially.

It's also lighter in terms of memory.

-sv
Duncan Mac-Vicar P.
2010-08-13 14:16:36 UTC
Permalink
Post by seth vidal
is this an issue for opensuse? I know it's one for fedora b/c of wanting
folks to be able to search for pkgs based on summary/description info in
their own language.
IIRC we parse all languages, and when doing queries, we use current
locale, with fallback.

Duncan

Michael Schroeder
2010-08-07 15:51:38 UTC
Permalink
Post by seth vidal
Post by Michael Schroeder
Post by seth vidal
the equivalent of: yum install /usr/bin/myprogram
This would work because /usr/bin/ files are in primary.xml.
libzypp doesn't support on-demand download of the filelist,
libsatsolver offers an interface for it.
rhel/fedora has a fair number of non *bin/* filedeps that we have to
look up. And try as I might, I can't seem to make that number go down.
Hmm, but *bin/* dependencies are already in primary, right? (At
least for the xml version.) So that doesn't hurt.

AFAIK you also have lots of dependencies for stuff in /usr/lib,
thats why you need to transfer the complete filelist.

Would it make sense to find out which directories are used in
the dependencies and also add them to the primary data like
the *bin/* files?
The idea is that you have to download that part of the filelist
anyways if you want to do depsolving (i.e. for yum install/update).

You would need to store the used glob patterns somewhere in the
primary file, so that yum knows when the primary data is sufficient
and when it has to download the complete file list.

Cheers,
Michael.
--
Michael Schroeder ***@suse.de
SUSE LINUX Products GmbH, GF Markus Rex, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
seth vidal
2010-08-09 18:04:37 UTC
Permalink
Post by Michael Schroeder
Hmm, but *bin/* dependencies are already in primary, right? (At
least for the xml version.) So that doesn't hurt.
AFAIK you also have lots of dependencies for stuff in /usr/lib,
thats why you need to transfer the complete filelist.
Actually, ours are worse than that:

*bin/*, of course
/etc, of course
/usr/lib/
/usr/share <- this is where the real annoyances come from - this is
mostly font requirements.
Post by Michael Schroeder
Would it make sense to find out which directories are used in
the dependencies and also add them to the primary data like
the *bin/* files?
It varies from release to release - but in general the ones
in /usr/share would make adding them to primary untenable.
Post by Michael Schroeder
The idea is that you have to download that part of the filelist
anyways if you want to do depsolving (i.e. for yum install/update).
This is why I was pushing for per-subdir sets of filelists mapped back
to pkgs. So someone looking for /usr/lib/something can just download a
subdir of /usr/lib/* rather than ALL of the whole filelists for that
repo.

-sv
Klaus Kaempf
2010-08-06 12:55:57 UTC
Permalink
Post by seth vidal
When you think about it - the xml format is a cache, too. It's a
compilation, truncation and compression of the data in the rpm hdrs.
Actually, it serves two purposes (requirements)

1. caching of rpm header information
2. providing a universally understood format ("XML's design goals
emphasize simplicity, generality, and usability over the Internet"
(according to http://en.wikipedia.org/wiki/XML)

I'd like to add a third requirement

3. minimize download time required by clients to update their knowledge
about a repository

Having read through the archive of this discussion, the thread started
with a proposal to improve #3, namely adding a better compression
scheme.


Are there other or different requirements for a repository metadata
scheme ?

Klaus
---
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
seth vidal
2010-08-06 14:11:40 UTC
Permalink
Post by Klaus Kaempf
Post by seth vidal
When you think about it - the xml format is a cache, too. It's a
compilation, truncation and compression of the data in the rpm hdrs.
Actually, it serves two purposes (requirements)
1. caching of rpm header information
2. providing a universally understood format ("XML's design goals
emphasize simplicity, generality, and usability over the Internet"
(according to http://en.wikipedia.org/wiki/XML)
I'd like to add a third requirement
3. minimize download time required by clients to update their knowledge
about a repository
Having read through the archive of this discussion, the thread started
with a proposal to improve #3, namely adding a better compression
scheme.
Are there other or different requirements for a repository metadata
scheme ?
I think a few more things are 'important to have' but not necesarily
'requirements':

1. make it relatively easy to provide translations of key pieces of
metadata - in this case summary and description

2. make it possible for the list of files in all pkgs to explode
exponentially w/o crippling the ability to depsolve pkgs

3. deal more gracefully with... umm.... poor packaging choices and
deps/provides generators that are a bit too excitable - we had an issue
in fedora with the erlang pkgs where the deps/provides went from about
30-40 per pkg to 20000 per pkg. That, obviously, made the metadata
balloon in size. We were able to shut down that particular nightmare but
it won't be the last one. Kernel interfaces are going to do this
eventually.

4. retain the ability to gpg/x509/etc sign the top layer (repomd.xml)
and use hashing to verify everything else the rest of the way down.


#4 there is moot - but it is important for security reasons.

-sv
James Antill
2010-08-06 14:53:48 UTC
Permalink
Post by Klaus Kaempf
Post by seth vidal
When you think about it - the xml format is a cache, too. It's a
compilation, truncation and compression of the data in the rpm hdrs.
Actually, it serves two purposes (requirements)
1. caching of rpm header information
2. providing a universally understood format ("XML's design goals
emphasize simplicity, generality, and usability over the Internet"
(according to http://en.wikipedia.org/wiki/XML)
XML is no more "universal" than sqlite, or a few other formats.
Post by Klaus Kaempf
I'd like to add a third requirement
3. minimize download time required by clients to update their knowledge
about a repository
That's fair, but there are several points to this:

i. How big are the metadata file sizes.

ii. How much of it you have to download for several usecases (install,
search, provides).

iii. How much of it changes, as packages are added. Some kind of delta
could help here, as could changing the format (a bit, or a lot).

iv. How much work the client has to do post download.

v. How much work the client has to do when the metadata changes.

vi. How big "repomd.xml" becomes.

...because users see the sum of all of this as "download time", in most
cases.
Adding ".xz" compression trades #i for #vi, and is probably worth it.
Using XML trades #i for #iv, #v and #vi ... and IMNSHO, isn't worth it.

The start of this thread, is due to #ii currently being far from
optimal for some cases.
Daniel Veillard
2010-08-06 14:57:52 UTC
Permalink
Post by James Antill
Post by Klaus Kaempf
2. providing a universally understood format ("XML's design goals
emphasize simplicity, generality, and usability over the Internet"
(according to http://en.wikipedia.org/wiki/XML)
XML is no more "universal" than sqlite, or a few other formats.
Well I guess most of the IT industry would disagree with you
on that statement, but having a diverging opinion is fine, just a
bad argument in general.

Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
***@veillard.com | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
seth vidal
2010-08-06 14:13:06 UTC
Permalink
Post by seth vidal
They shouldn't need to change at all.
In fact my original thoughts were that repomd.xml doesn't need to
change, except to add an attribute of 'compression type' to each
datatype.
One more item that had occurred to me for repomd.xml - it would be nice
to be able to, optionally, have an expiration on the metadata so that a
client could know when the data SHOULD have been updated so it can be
sure it is not being intentionally shown out of date information by a
rogue server.

-sv
Duncan Mac-Vicar P.
2010-08-13 12:21:45 UTC
Permalink
Post by seth vidal
One more item that had occurred to me for repomd.xml - it would be nice
to be able to, optionally, have an expiration on the metadata so that a
client could know when the data SHOULD have been updated so it can be
sure it is not being intentionally shown out of date information by a
rogue server.
-sv
We already had a conf call about that one and agreed on that tag.

SUSE is already using it, in our repomd.xml extension file, called
suseinfo.xml

curl http://download.opensuse.org/update/11.3/repodata/suseinfo.xml

<?xml version="1.0" encoding="UTF-8"?>
<suseinfo xmlns="http://linux.duke.edu/metadata/repo">
<expire>2592000</expire>
</suseinfo>

We use the same parser for repomd.xml and suseinfo.xml. That means, if
you put that tag in repomd.xml next to the "tags" tag, we remain
compatible. We just kept it in suseinfo.xml to not break compatibility
with yum.

Duncan
Panu Matilainen
2010-08-12 09:48:26 UTC
Permalink
Post by seth vidal
Post by Duncan Mac-Vicar P.
I was thinking about the sqlite part, and I think the change has to go
in a way that
* assumes sqlite3 is _yum_ cache
why? Why not just a random-access md?
/me wakes up from hibernation to dust 10cm layer of dust off my apt-rpm hat

Because yum is the only depsolver that can use it directly. Everything
else uses their other internal cache formats, requiring yet another
conversion on the data. And for a one-time read-through + convert of data,
sequential read of xml is/can be actually faster than random access reads
from sqlite:

On my laptop, apt-rpm with F13 default repositories, generating the
internal cache from scratch takes ~35s with XML files. With the sqlite DB
files, it takes ~46s. From what I read here, zypp and smart share similar
experiences (whatever the exact numbers are I've no idea).

In case of apt-rpm, sqlite /is/ a huge win over XML for the operations (eg
file searches) where the data isn't stored in the internal cache but has
to be looked up from the repository files. With XML those operations are
simply pathological, similar to the smartpm case of 30min vs 30s vs 3s.

If we're talking about redesigning repomd, it'd be a huge mistake to not
at least attempt to address all the (now) known problems of the initial
design + extensions it has grown over time, and one of them is: what's
good for yum can hurt others, because they operate in a wildly different
ways. And mind you, I'm not pointing any fingers as I'd be as guilty as
anyone who's been around since the initial repomd spec discussions and
either not seeing the issues coming or didn't speak up / submit code.

- Panu -
Duncan Mac-Vicar P.
2010-08-13 12:25:18 UTC
Permalink
Post by Panu Matilainen
/me wakes up from hibernation to dust 10cm layer of dust off my apt-rpm hat
Because yum is the only depsolver that can use it directly. Everything
else uses their other internal cache formats, requiring yet another
conversion on the data. And for a one-time read-through + convert of
data, sequential read of xml is/can be actually faster than random
On my laptop, apt-rpm with F13 default repositories, generating the
internal cache from scratch takes ~35s with XML files. With the sqlite
DB files, it takes ~46s. From what I read here, zypp and smart share
similar experiences (whatever the exact numbers are I've no idea).
In case of apt-rpm, sqlite /is/ a huge win over XML for the operations
(eg file searches) where the data isn't stored in the internal cache
but has to be looked up from the repository files. With XML those
operations are simply pathological, similar to the smartpm case of
30min vs 30s vs 3s.
If we're talking about redesigning repomd, it'd be a huge mistake to
not at least attempt to address all the (now) known problems of the
initial design + extensions it has grown over time, and one of them
is: what's good for yum can hurt others, because they operate in a
wildly different ways. And mind you, I'm not pointing any fingers as
I'd be as guilty as anyone who's been around since the initial repomd
spec discussions and either not seeing the issues coming or didn't
speak up / submit code.
You described exactly what I think on the issue.

We use xml only as the starting point. But just as yum can solve over
sqlite, our solver operates over solv structures directly. And we can't
move away from that. We could improve the starting point. However,
setting it on sqlite, only benefits yum.

Duncan
Loading...