
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
History is littered with hundreds of conflicts over the future of a community, group, location or business that were "resolved" when one of the parties stepped ahead and destroyed what was there. With the original point of contention destroyed, the debates would fall to the wayside. Archive Team believes that by duplicated condemned data, the conversation and debate can continue, as well as the richness and insight gained by keeping the materials. Our projects have ranged in size from a single volunteer downloading the data to a small-but-critical site, to over 100 volunteers stepping forward to acquire terabytes of user-created data to save for future generations.
The main site for Archive Team is at archiveteam.org and contains up to the date information on various projects, manifestos, plans and walkthroughs.
This collection contains the output of many Archive Team projects, both ongoing and completed. Thanks to the generous providing of disk space by the Internet Archive, multi-terabyte datasets can be made available, as well as in use by the Wayback Machine, providing a path back to lost websites and work.
Our collection has grown to the point of having sub-collections for the type of data we acquire. If you are seeking to browse the contents of these collections, the Wayback Machine is the best first stop. Otherwise, you are free to dig into the stacks to see what you may find.
The Archive Team Panic Downloads are full pulldowns of currently extant websites, meant to serve as emergency backups for needed sites that are in danger of closing, or which will be missed dearly if suddenly lost due to hard drive crashes or server failures.
When working with fail2ban, I noticed that some operations were really slow while reading from journal instead of text log files. This is not really noticeable with the default daemon operation of
fail2ban, but more with its test tool calledfail2ban-regex.The main idea is that currently, all field's values are converted, which is completely uneeded for a lot of use cases. Moreover, the use of a
ChainMapseems useless, but it is extremely costly, especially when no custom converters are provided.I've implemented a test tool to monitor some possible improvements. Here are the raw results:

On the X axis, you can see patch sets, grouped by loads, that are further described. On the Y axis, these are number of cycles, as reported by
perf(from the Linux kernel). The tests were performed with two sets:The chart above is from the Tumbleweed desktop, but the chart made using Core2 has strictly the same ratio. Only the numbers are a bit higher.
Loads
They are operations to be done with one journal entry. There are currently 4 of them:
_SYSTEMD_UNITandMESSAGE.Patch sets
init_reader_nochainmap (aka nochainmap)
Currently, the
Readerclass uses aChainMapto merge default converters and user provided ones. This is only done if Python version is greather or equal than 3.3. I've noticed that the time spent into theget()method is really high. It seems thatChainMapis good for merging either big dictionaries (where a call toupdate()is costly), or dynamic dictionaries (where an update into one dictonary is reflected into theChainMapinstance). It seems thatReaderis not in any of these 2 cases, so this patche completely removes the use ofChainMap, copies the default converters dict, and updates it with custom converters if needed.This is by far the easiest, less intrusive, and most efficient patch in all cases.
convert_entry_lazy
This is the first step of a convert-on-demand step. Instead of converting all field values in
_convert_entry, we just return aMutableMappingwith all already fetched values as bytes, and convert values only on request. Converted values are not recorded, since we assume the caller will not try to fetch the same field twice.get_next()is untouched, so special fields (__CURSOR,__MONOTONIC_TIMESTAMP,__REALTIME_TIMESTAMP) are already fetched, but not converted.This is a good approach, but there is nearly always a better way to do it, except for the read_all_fields corner case.
get_next_lazy (aka next_lazy)
This is an evolution of convert_entry_lazy, that modifies
get_next()itself. By default, there is no call to_get_all(), and only an instance ofMappingis returned. Special fields are not even computed. When requested, a field is then fetched using_get()and then_convert_value(). Result is not recorded for the same reason.To be able to iterate on this
Mapping, the first call to__iter__or__len__will fetch all fields using_get_all(), add special field names, and record the result. A__bool__method is also provided to avoid costly call to__len__()just to test if the current entry is valid (it is).This is the best approach for cases where there is a small amount of fields to read, as in empty and digest loads.
get_next_lazy_prefetched (aka next_lazy_prefetched)
This is a mix between convert_entry_lazy and get_next_lazy, to show the effects of just calling
_get_all(), without creating special entries. Values are only converted on request, and special fields are fetched and converted on request.This is usually a better approach than convert_entry_lazy, except in read_all_fields, which I can't explain. As we can see in the digest load, the difference between this and get_next_lazy is minimal, so we can assume that calling
_get_all()has a negligible impact, except for the empty corner case.convert_value_nolist (immature)
This patch is not visible in the chart.
The goal is to remove the
isinstance()test when converting values, since the vast majority of values are not list. The impact is not negligible, but I found some logs where values returned by_get_allare lists. The journal format has no such list, but field name can be duplicated, so_get_allgather them into lists. In my tests, I found 3 messages with such cases, using journal from openSuse Tumbleweed, withSYSLOG_IDENTIFIERfield. On Debian/testing, I was not able to found this case.Analysis
Disable
ChainMapusage in all cases: this is always a win. I will open a merge request for this soon.Use convert-on-demand: In nearly all cases, it is more interesting to call
_get_all()without converting anything.Import caveat for
get_next_lazy: on request, it uses calls to_get(). If a field is duplicated,_getonly returns the first value, while_get_allwill gather all values in a list. This is notified in systemd/systemd#9696.As a result of this, I would like to write a patch for
get_next()that does the same job as get_next_lazy_prefetched by default, with an optional boolean (defaulted to True) for calling or notget_all. When set toFalse, it will be the same as get_next_lazy, with the important caveat mentioned before. But for small loads, it will be faster.Conclusion
I would like to know if you have any thoughts about this:
You can use
run_all.shin my own tool, that will write areport_aggregate.csvwhich is easy to turn into chart using LibreOffice or any other charting tool of your choice.