Bojan Bjelić

Category: Encounter

ArchiveBox: Open-source self-hosted web archiving.
https://github.com/ArchiveBox/ArchiveBox

ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline.

You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows.

You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list.

It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list.

The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
https://github.com/ArchiveBox/ArchiveBox
1,477 forks.
26,836 stars.
206 open issues.

Recent commits:

Tag current maintainer of AUR package (#1761)<!– IMPORTANT: Do not submit PRs with only formatting / PEP8 / linelength changes. –># SummaryAdd the maintainer info of the ArchiveBox AUR package foraccountability. Much of the packaging has changed since the time of itsinitial contribution and I as the current maintainer will make surethese changes will work smoothly moving forward. I will also make surethis AUR package will be up to date once the 0.9.x branch is released.# Related issues<!– e.g. #123 or Roadmap goal #https://github.com/pirate/ArchiveBox/wiki/Roadmap –># Changes these areas- [ ] Bugfixes- [ ] Feature behavior- [ ] Command line interface- [ ] Configuration options- [ ] Internal architecture- [ ] Snapshot data layout on disk<!– This is an auto-generated description by cubic. –>—## Summary by cubicUpdate README to tag the current maintainer of the Arch AUR package.Adds “maintained by @jasongodev” next to the original contributor toimprove accountability and clarify support.<sup>Written for commit 0d05fd8c53e083bf3a356df47c13b3c21257d89f.Summary will update on new commits.</sup><!– End of auto-generated description by cubic. –>, GitHub

Delete TEST_RESULTS.md, GitHub

Tag current maintainer of AUR package, GitHub

FIX: docker build (#1760)<!– IMPORTANT: Do not submit PRs with only formatting / PEP8 / linelength changes. –># SummaryThis PR fixes the docker image build. Also fixes the uuid7 not founderror on the first run of `archivebox init`.<!– This is an auto-generated description by cubic. –>—## Summary by cubicFixes the Docker image build and the uuid7 error on first init. We nowuse uv-managed Python 3.13 and patch uuid.uuid7 before Djangomigrations.- **Bug Fixes**- Docker: switch to uv-managed Python, create venv with uv –python,skip version check at build, and start with –init.- UUID7: add uuid_compat, import it early, and monkey-patch uuid.uuid7on <3.14 to keep migrations working.- **Dependencies** – Bump Python to 3.13. – Require uuid_extensions on Python <3.14.<sup>Written for commit 9aa4f0de587d9cbe5c1e20155295b0353dd4f5a9.Summary will update on new commits.</sup><!– End of auto-generated description by cubic. –>, GitHub

FIX: The docker entrypoint doesn't have –quick-init, Pellaeon Lin
July 19, 2021
Jailer is a tool for database subsetting and relational data browsing
Database Subsetting and Relational Data Browsing Tool.
https://github.com/Wisser/Jailer
141 forks.
3,128 stars.
5 open issues.

Recent commits:

16.11.1, Wisser

16.11.1.1, Wisser

16.11.1, Wisser

Bumped copyright year, Ralf Wisser

Improvement of dynamic table view in the Navigator., Ralf Wisser
July 15, 2021
How Technology Architects make decisions
Great consideration of necessity for documenting technology decision making. Also, difficulties of culture and mentality changes.

How Technology Architects make decisions

…
- Compensatory. This type of decision considers every alternative, analysing all criteria in low-level detail. Criteria with different scores can compensate for each other, hence the name. There are two types here:
  - Compensatory Equal Weight – criteria are scored and totalled for each potential option, the option with the highest total signifies the best decision.
  - Compensatory Weighted Additive (WADD) – here a weighting is given for a criterion to reflect significance (the higher the weighting, the higher the significance). The weighting is multiplied by the score for each criterion, then each alternative is summed, the highest total winning.
- Non-Compensatory. This method uses fewer criteria. The two types are:
  - Non-Compensatory Conjunctive – alternatives that cannot meet a criterion are immediately dismissed, the winner is chosen among the survivors.
  - Non-Compensatory Disjunctive – an alternative is chosen if it complies with a criterion, irrespective of other criteria.
…

Compensatory decisions are suitable when time and resources are available to
- gather the right set of alternatives,
- evaluate each alternative in detail
- score each with consistency and precision.
Non-Compensatory decisions are necessary when
- there is time stress
- the problem is not well structured
- the information surrounding the problem is incomplete
- criteria cant be expressed numerically
- there are competing goals
- the stakes are high
- there are multiple parties negotiating in the decision.
June 29, 2021
ribosome (A simple generic code generation tool)

A simple generic code generation tool

Source: ribosome

June 21, 2021
Joplin (open source note taking app)

Joplin
An open source note taking and to-do application with synchronisation capabilities

https://joplinapp.org/

June 15, 2021
The pedantic checklist for changing your data model in a web application

https://rtpg.co/2021/06/07/changes-checklist.html

Quote:

—

Let’s say you have a web app with some database. In your database you have an Invoice model, where you store things you are charging your customers.

Your billing system is flexible, so you support setting up tax inclusive or exclusive invoices, as well as tax free invoices! In this model you store it as:

class Invoice:
…
is_taxable: bool
is_tax_inclusive: bool
You use this for a couple of years, write a bunch of instances to your database. The one day you wake up and decide that you no longer like this model! It allows for representing invalid state (what would be a non-taxable, but tax-inclusive invoice?). So you decide that you want to go for a new data model:

class TaxType(enum.Enum):
no_tax = enum.auto()
tax_inclusive = enum.auto()
tax_exclusive = enum.auto()

class Invoice:
tax_type: TaxType
You write up this new code, but there’s now a couple problems:

Your database’s data is all in the old shape! You’ll need to move over all your data to this new data model.
You’re a highly available web app! You’re not gonna do any downtime (well, planned, anyways) unless you can’t avoid it.
Your system is spread across multiple machines (high availability right?) so you have to deal with multiple versions of your backend code running at the same time during a deployment.
There’s a whole list of steps you have to do in order to roll out this change. There’s a whole thing about “double writing”, “backfilling” etc. But there’s actually a lot of steps when you end up actually needing to make a backwards-incompatible change!

I feel like I know this list, but every once in a while I end up missing one step when I go off the beaten path, so here’s the list, with every little step required, in all the pedantry.

An important detail here is that you need to roll out each version one-by-one. You can have some parts of your system on Version 3 and others on Version 4. But if you have some of your system on Version 3 and others on Version 5, you will hit issues with data stability. This makes for a lot of steps! Depending on your data model, you might be able to fuse multiple versions into one (especially when you have a flexible system).

Version 0: Read/Write Old Representation
The initial version of your application is using the old model, of course. This is your starting point, and it might not actually be ready for us to start introducing a new model (especially if your application <-> DB layer is particularly strict about what it receives)

Version 1: Can Accept The New Representation
This version will be able to read your data and not blow up if the new representation is present. This doesn’t mean you are using the new representation for anything! Just that you can handle it.

A lot of systems don’t actually require this as a distinct step. You can add a new column to a database and have existing queries continue to work just fiine. But there are a couple places where you need to be more careful. Some examples:

Adding a new value to an enumeration. If I only have tax_inclusive and tax_exclusive, I need to put into place no_tax-handling code before I start migrating data over to this (or having new rows use it).
systems with strict validation. A system might have error paths when a new key starts appearing in some JSON dictionary, so you might need to add preparation code for this.
Migration 1: Add The New Representation To The Database
For an SQL database, this usually is about adding a new column to your database. Some databases might not need this step, and some data model changes might not need this (for example if you are just adding a new value into an enumeration, but the underlying data was stored as a string)

Version 2: Write To Old + New Representation
This version of your application will start filling in both representations on writes to your database. You still continue to read from the old representation (so that writes that happened with V1 of the application still make sense), and writing to the old representation means that during your V1 -> V2 deploy, V1 reads don’t get stale.

Migration 2: Backfill The New Representation In Existing Data
For any rows that haven’t been written to since you rolled out Version 2, you won’t have filled in the new representation of your data (maybe a user just hasn’t logged in for a while!). In order to make sure the new representation is ready to be read, this migration should go through all existing data and fill in the new representation.

You need to do this after Version 2 is deployed, because if a Version 1 write happens during the migration, then the backfilled value will actually be stale. And you need to do this migration before you begin reading the new representation, so that old records can be properly read.

Version 3: Read From The New Representation
You have now filled out the new field, so you can read from it! However, you still need to be writing to both representations. Why? Because Version 2 of your application is still reading from the old field! During a deployment you’ll still have machines on previous versions, so you need to be compatible with coexisting, at least for the duration of a deploy.

Migration 3: Remove Any Mandatory Constraints For Old Representation
This is sometimes not needed, but if you are removing an old field that was once required, you’ll want to remove those constraints at this point. If you don’t do this, then once a version of the app is deployed which removes references to the old version, you will likely hit database constraint failures or the like.

Version 4: Read/Write From The New Representation Only, Remove References To Old Representation
At this point, the previous version was only writing to the old representation for backwards-compatibility reasons. So you can stop writing to the old representation, and have all read/write paths just hit the new one.

At this point you also want to remove references to the old representation (in particular stuff like model fields), in preparation for the final migration.

Migration 4: Drop The Old Representation Entirely
Once you have version 4 out, queries should no longer be referencing the old representation at all. You should be good to go for just dropping this stuff entirely!

This one you gotta be sure though, really hard to roll back this change.

Once you’ve done that you’re good to close out that work!

The Checklist
Deploy Version 1 (Accept New Representation)
Add New Representation To The Database
Deploy Version 2 (Read Old/Write Old + New)
Backfill The New Representation In Existing Data
Deploy Version 3 (Read New/Write Old + New)
Remove Any Mandatory Constraints For The Old Representation
Deploy Version 4 (Read/Write New, Remove References To Old)
Drop The Old Representation Entirely
In your specific case it could be that some of these can be merged. I’ve found that these steps are general enough to cover the most scenarios safely. Though really, the best thing is to really understand why so many steps are needed and whether the characteristics of your system impose different constraints.

June 15, 2021
PlantUML – UML Diagramming Tool
This looks like a great set of tools to create different types of UML diagrams from simple text representations. Uses Graphviz for some diagram types.

https://plantuml.com/

Quoting their website

PlantUML is a component that allows to quickly write :
- Sequence diagram
- Usecase diagram
- Class diagram
- Object diagram
- Activity diagram (here is the legacy syntax)
- Component diagram
- Deployment diagram
- State diagram
- Timing diagram
The following non-UML diagrams are also supported:
- JSON data
- YAML data
- Network diagram (nwdiag)
- Wireframe graphical interface (salt)
- Archimate diagram
- Specification and Description Language (SDL)
- Ditaa diagram
- Gantt diagram
- MindMap diagram
- Work Breakdown Structure diagram (WBS)
- Mathematic with AsciiMath or JLaTeXMath notation
- Entity Relationship diagram (IE/ER)
April 26, 2021
Grady Booch – A thread regarding the architecture of software-intensive systems.

Quoting Twitter thread by @Grady_Booch on 4th of September 2020.

There is more to the world of software-intensive systems than web-centric platforms at scale.
A good architecture is characterized by crisp abstractions, a good separation of concerns, a clear distribution of responsibilities, and simplicity.

All else is details.
You cannot reduce the complexity of a software-intensive systems; the best you can do is manage it.
In the fullness of time, all vibrant architectures must evolve.

Old software never dies; you must kill it.
Some architectures are intentional, some are accidental, most are emergent.
Meaningful architecture is a living, vibrant process
of deliberation, design, and decision.
The relentless accretion of code over days, months, years
and even decades quickly turns every successful new project into a legacy one.
Show me the organization of your team and I will show you the architecture of your system.
All well-structured software-intensive systems
are full of patterns.
A software architect who does not code is like
a cook who does not eat.
Focusing on patterns and cross-cutting concerns
can yield an architecture that is smaller, simpler, and more understandable.
Design decisions encourage what a particular stakeholder can do as well as what constrains what a stakeholder cannot.
In the beginning, the architecture of a software-intensive system is a statement of vision. In the end, the rchitecture of every such system is a reflection of the billions upon billions of small and large, intentional and accidental design decisions made along the way.
All architecture is design, but not all design is architecture.

Architecture represents the set of significant design decisions that shape the form and the function of a system, where significant is measured by cost of change.

https://threadreaderapp.com/thread/1301810358819069952.html

https://twitter.com/grady_booch/status/1301810358819069952?s=21

September 4, 2020
Detecting if a point is inside or outside of a path

my favorite way to see if a point is inside or outside a path, is using its winding number🍥

traverse the path from the perspective of a point and add up the amount of turning along the way

if it made a full turn, it's inside
if it wound back to 0, it's outside

it's so neat~💖 pic.twitter.com/oDGxq697cI
— Freya Holmér (@FreyaHolmer) February 27, 2020

@FreyaHolmer

my favourite way to see if a point is inside or outside a path, is using its winding number traverse the path from the perspective of a point and add up the amount of turning along the way if it made a full turn, it’s inside if it wound back to 0, it’s outside it’s so neat~

May 5, 2020
CSV-related tools and resources – AwesomeCSV
The URL: https://github.com/secretGeek/AwesomeCSV

The following content was taken from GitHub, authored by Leon Bambrick.
Awesome CSV

A carefully curated list of CSV-related tools and resources

CSV remains the most futuristic data format from the distant past.

XML has risen and fallen. JSON is just a flash in the pan. YAML is a poisoned chalice. CSV will outlast them all.

When the final cockroach breathes her last breath, her dying act will be to scratch her date of death in a CSV file for posterity.

Contents
Here are some awesome tools for dealing with CSV:

Tools
- NimbleText/Live – Use patterns to manipulate CSV; the world’s simplest code generator *.
- PapaParse – A powerful in-browser CSV parser.
- d3-dsv – d3.js parser and formatter module for delimiter-separated values.
- CSVKit – CSV utilities that includes csvsql / csvgrep / csvstat and more.
- XSV – A fast CSV command-line toolkit written in Rust.
- sed (gnu tool) – Stream editor.
- gawk (gnu tool) – Text processing and data extraction using awk.
- awk by example – Comprehensive examples of using awk.
- Miller – Like sed / awk / cut / join / sort etc for name-indexed data such as CSV.
- ParaText – CSV parsing at 2.5 GB per second.
- CSVGet – Get structured data from sites as CSV.
- CSVfix – A tool for manipulating CSV data.
- Tad – A fast free cross-platform CSV viewer.
- Nvd3-tags – A tiny library for making charts from csv data.
- Powershell: Import-CSV – Powerful in-built facility for dealing with CSV (example).
- CSV Tools – A collection of useful CSV utilities.
- graph-cli – Flexible command line tool to create graphs from CSV data.
- CSV to SQL – Online tool to create insert/update/delete etc from CSV data.
- C#: kbCSV – An efficient, easy to use .NET parsing and writing library for CSV.
- csvprintf – UNIX command line utility for parsing and formatting output based on CSV files.
- Mockaroo – Random data generator for CSV / JSON / SQL / Excel.
- Ron’s CSV Editor – Handles big files, does miraculous things. A timeless editor for a timeless format.
- Rainbow CSV plugins – Collection of text editor plugins for CSV/TSV syntax highlighting. Available for Vim, VS Code, Atom, Sublime Text and other editors.
- Mighty Merge – join/union csv files.
Repair or Validate CSV
- Csvlint.go – Command line tool for validating CSV files against RFC 4180.
- csvstudio – A smart app to repair syntax errors in very large CSV files.
- scrubcsv – Remove bad records from a CSV file and normalize (requires rust)
- reconcile-csv – Find relationships between a set of related CSVs
Generate Table Schema
- CSV Schema — Analyzes a CSV file and generates database table schema, all within the browser
- Wanted: More tools in this category.
Treat CSV as SQL
- TextQL – Execute SQL against CSV or TSV.
- Datasette Facets – Faceted browse and a JSON API for any CSV File or SQLite DB.
- q – Run SQL Directly on CSV Files
- RBQL – Rainbow Query Language, a SQL-like language with JavaScript or Python backend.
- PSKit Query — Powershell module lets you run simple queries over objects, including imported with csv
Convert to or from CSV
- CSV to Table – Convert CSV files to searchable and sortable HTML table.
CSV <-> JSON
- Agnes – Two way Csv to Json **.
- csv2json – online tool to convert your CSV or TSV formatted data to JSON and vice versa.
- csv-to-json – Easy, privacy-friendly and offline-first online csv to json converter.
Essays

Once you’ve found the perfect data serialization file format, you stop looking

David Wengier
- Thinking about CSV – Martin Fenner.
- In Praise of CSV – Waldo Jaquith.
- Stop Rolling Your Own CSV Parser! – Leon Bambrick ***.
- So You Want To Write Your Own CSV code? – Thomas Burette.
- Falsehoods Programmers Believe About CSVs – Jesse Donat.
- ASCII Delimited Text – Not CSV or TAB delimited text – Ronald Duncan.
Data
- US Data.gov – 18789+ CSV datasets.
- Australian Government Open Data – 2715+ CSV datasets.
- Reference data in csv – Easy-to-use reference data in CSV and JSON formats.
- awesome-public-datasets – A topic-centric list of high-quality open datasets in public domains.
Conferences
- csv,conf – A community conference for data makers everywhere.
Standards

The wonderful thing about standards is that there are so many of them to choose from.<br />—(Possibly) Grace Hopper.
- RFC 4180 – "Common format and MIME Type for Comma-Separated Values (CSV) Files".
  
  Definition of the CSV Format
  
  MIME Type Registration of text/csv
- W3C: Model for Tabular Data and Metadata on the Web
- CSV Schema Language – A language for defining and validating CSV data.
- csv,specs – Comma-Separated Values (CSV) Format Specifications (and Tests) incl. CSV v1.0, CSV v1.1, CSV Strict, CSV <3 Numerics, CSV<3 JSON, CSV <3 YAML.
- Tabular Data Resource – A Data Resource specialized for describing tabular data like CSV files or spreadsheets
- CSVY – A standard for adding a YAML header to CSV files to describe their format
META: Other similar lists
- structured-text-tools – List of command line tools for manipulating CSV / XML / HTML / JSON / INI etc.
- META-META – This list as CSV.
- META-META-META – A NimbleText pattern that produces this markdown page from this list as a CSV.
Code of Conduct

See Code of Conduct

Funtribute

To experience the fun of contributing, see Contributing

Footnotes

* <span id=’footnote1′ ></span> I’m the author of NimbleText. Of course I put it first on the list. If I didn’t personally rate it I wouldn’t have spent so much time making and improving it.

** <span id=’footnote2′ ></span> I wrote agnes but don’t really endorse it for others to use (thus haven’t migrated the source code to GitHub). It’s slow and non-streaming. I’d go with papa-parse. On the plus side, agnes has a more comprehensive test suite and simpler api than most.

*** <span id=’footnote3′ ></span> Mine too.

License

To the extent possible under law, Leon Bambrick has waived all copyright and related or neighboring rights to this work.
March 12, 2020